Polyfit - Because statistics is hard, and linear regression is made entirely out of footguns
⚓ Rust 📅 2026-01-28 👤 surdeus 👁️ 1I needed to draw a curve fit through some data, and it turned into a year long rabbit hole, where I discovered that stats is really involved, and that the rust ecosystem is a bit barren in terms of user-friendly batteries-included polynomial fitting libraries.
So I built Polyfit - Because you don't need to be able to build a powerdrill to use one safely.
- The full power of polynomial fitting without needing to understand all the math
- Sensible parameters (DegreeBound, scoring metrics, basis functions) that don't feel arbitrary or like magic numbers
- Extensive documentation, examples, and built in testing tools
GitHub | Crates.io | Documentation | Homepage
My goals for the project were:
- Never ask for a number without context - ask for a random number and you get a random number
- Instead, if I can derive the correct value myself I do
- If I can't, I have named presets that describe in detail why you'd pick them
- Provide sensible defaults for everything
- If you don't care about a setting, you shouldn't have to think about it
- You should not need to understand the math to get good results
- Performance
- I tried to prioritize speed and memory efficiency where possible
- On my fairly average laptop, I can do a 100 million point fit in ~1s
- You need to be able to test it
- Not understanding the math shouldn't be a barrier to making sure it works
- There's a whole test suite included with extensive docs, examples, and sensible defaults
- The tests even generate a plot on failure so you can see what went wrong
- And I included a set of random noise injection transforms to help you make synthetic data for testing
- The tests will even show seeds used on failure for reproducibility
Here's some examples of why you'd want to use Polyfit
Oh no! I have all this data and I need to draw a line through it
use polyfit::{
score::Aic,
statistics::DegreeBound,
ChebyshevFit,
};
let mut fit = ChebyshevFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;
let equation = fit.as_monomial()?.to_string();
let pretty_line = fit.solve_range(0.0..=100.0, 1.0)?;
- Chebyshev fitting is more numerically stable so it's a good default choice
- DegreeBound::Relaxed uses your data to pick a reasonable degree without overfitting
- Aic is a scoring metric. Smallish datasets tend to do well with it
We use as_monomial to get the equation in a human readable format.
Oh gee willikers How am I going to figure out which of these data points are outliers
let covariance = fit.covariance()?; // It's the thing that tells us how certain we are about the fit just roll with it
let outliers = covariance.outliers(Confidence::P95, Some(Tolerance::Absolute(0.1)))?;
- The Confidence is just a measure of how much you trust the fit. P95 is a good option
- I added Tolerance because real world data is messy. If I know my sensor is only accurate to +/- 0.1 units I shouldn't need to mess with the confidence level to account for that. It's basically an engineering correction for Confidence
I also have extensive calculus support, so
- Say you have weather data with temperature over time:
use polyfit::{FourierFit, score::Aic, statistics::DegreeBound};
let fit = FourierFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;
// Derivatives for rates of change
// Critical points are neat for this
// This tells us when the temperature stops rising or falling and starts doing the opposite
for point in fit.critical_points()? {
match p {
CriticalPoint::Minima(x, _y_) => println!("Found a local minimum at x = {}", x),
CriticalPoint::Maxima(x, _y_) => println!("Found a local maximum at x = {}", x),
CriticalPoint::Inflection(x, _y_) => println!("Found an inflection point at x = {}", x),
}
}
There's too many options how do I pick a basis for my data!
First read these:
And also call basis_select!()
It tests your data on every basis I support and gives you an easy to digest report:
| Basis | Params | Score Weight | Fit Quality | Normality | Rating
--|--------------------------------|--------|--------------|-------------|-----------|-----------
1 | Fourier | 9 | 100.00% | 99.00% | 67.80% | 71% ☆☆★★★
2 | Laguerre | 11 | 0.00% | 69.86% | 0.00% | 33% ☆☆☆☆☆
3 | Legendre | 11 | 0.00% | 70.91% | 0.00% | 34% ☆☆☆☆☆
--|--------------------------------|--------|--------------|-------------|-----------|-----------
4 | Chebyshev | 11 | 0.00% | 70.91% | 0.00% | 34% ☆☆☆☆☆
5 | Logarithmic | 11 | 0.00% | 68.17% | 0.00% | 33% ☆☆☆☆☆
6 | Probabilists' Hermite | 7 | 0.00% | 66.04% | 0.00% | 50% ☆☆☆☆★
7 | Physicists' Hermite | 10 | 0.00% | 68.88% | 0.00% | 36% ☆☆☆☆☆
[ How to interpret the results ]
[ Results may be misleading for small datasets (<100 points) ]
- Score Weight: Relative likelihood of being the best model among the options tested, based on the scoring method used.
- Fit Quality: Proportion of variance in the data explained by the model (uses huber loss weighted r2).
- Normality: How closely the residuals follow a normal distribution (useless for small datasets).
- Rating: Combined score (0.75 * Fit Quality + 0.25 * Normality) to give an overall quality measure.
- Stars: A simple star rating out of 5 based on the Rating score. Not scientific.
- The best 3 models are shown below with their equations and plots (if enabled).
- Less params is a simpler model, which is better
- Better fit quality means it explains more of the data
- Better normality means it's probably not underfitting (too simple)
- The rating is a weighted combination of fit quality and normality to give an overall score
3 posts - 2 participants
🏷️ Rust_feed