Polyfit - Because statistics is hard, and linear regression is made entirely out of footguns

⚓ Rust    📅 2026-01-28    👤 surdeus    👁️ 1      

surdeus

I needed to draw a curve fit through some data, and it turned into a year long rabbit hole, where I discovered that stats is really involved, and that the rust ecosystem is a bit barren in terms of user-friendly batteries-included polynomial fitting libraries.

So I built Polyfit - Because you don't need to be able to build a powerdrill to use one safely.

  • The full power of polynomial fitting without needing to understand all the math
  • Sensible parameters (DegreeBound, scoring metrics, basis functions) that don't feel arbitrary or like magic numbers
  • Extensive documentation, examples, and built in testing tools

GitHub | Crates.io | Documentation | Homepage

My goals for the project were:

  • Never ask for a number without context - ask for a random number and you get a random number
    • Instead, if I can derive the correct value myself I do
    • If I can't, I have named presets that describe in detail why you'd pick them
  • Provide sensible defaults for everything
    • If you don't care about a setting, you shouldn't have to think about it
    • You should not need to understand the math to get good results
  • Performance
    • I tried to prioritize speed and memory efficiency where possible
    • On my fairly average laptop, I can do a 100 million point fit in ~1s
  • You need to be able to test it
    • Not understanding the math shouldn't be a barrier to making sure it works
    • There's a whole test suite included with extensive docs, examples, and sensible defaults
    • The tests even generate a plot on failure so you can see what went wrong
    • And I included a set of random noise injection transforms to help you make synthetic data for testing
    • The tests will even show seeds used on failure for reproducibility

Here's some examples of why you'd want to use Polyfit


Oh no! I have all this data and I need to draw a line through it

use polyfit::{
    score::Aic,
    statistics::DegreeBound,
    ChebyshevFit,
};

let mut fit = ChebyshevFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;
let equation = fit.as_monomial()?.to_string();
let pretty_line = fit.solve_range(0.0..=100.0, 1.0)?;
  • Chebyshev fitting is more numerically stable so it's a good default choice
  • DegreeBound::Relaxed uses your data to pick a reasonable degree without overfitting
  • Aic is a scoring metric. Smallish datasets tend to do well with it

We use as_monomial to get the equation in a human readable format.


Oh gee willikers How am I going to figure out which of these data points are outliers

let covariance = fit.covariance()?; // It's the thing that tells us how certain we are about the fit just roll with it
let outliers = covariance.outliers(Confidence::P95, Some(Tolerance::Absolute(0.1)))?;
  • The Confidence is just a measure of how much you trust the fit. P95 is a good option
  • I added Tolerance because real world data is messy. If I know my sensor is only accurate to +/- 0.1 units I shouldn't need to mess with the confidence level to account for that. It's basically an engineering correction for Confidence

I also have extensive calculus support, so

  • Say you have weather data with temperature over time:

More Details

use polyfit::{FourierFit, score::Aic, statistics::DegreeBound};
let fit = FourierFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;

// Derivatives for rates of change
// Critical points are neat for this
// This tells us when the temperature stops rising or falling and starts doing the opposite
for point in fit.critical_points()? {
    match p {
        CriticalPoint::Minima(x, _y_) => println!("Found a local minimum at x = {}", x),
        CriticalPoint::Maxima(x, _y_) => println!("Found a local maximum at x = {}", x),
        CriticalPoint::Inflection(x, _y_) => println!("Found an inflection point at x = {}", x),
    }
}

There's too many options how do I pick a basis for my data!

First read these:

And also call basis_select!()

It tests your data on every basis I support and gives you an easy to digest report:

  |             Basis              | Params | Score Weight | Fit Quality | Normality | Rating
--|--------------------------------|--------|--------------|-------------|-----------|-----------
1 |                        Fourier |      9 |      100.00% |      99.00% |    67.80% | 71% ☆☆★★★
2 |                       Laguerre |     11 |        0.00% |      69.86% |     0.00% | 33% ☆☆☆☆☆
3 |                       Legendre |     11 |        0.00% |      70.91% |     0.00% | 34% ☆☆☆☆☆
--|--------------------------------|--------|--------------|-------------|-----------|-----------
4 |                      Chebyshev |     11 |        0.00% |      70.91% |     0.00% | 34% ☆☆☆☆☆
5 |                    Logarithmic |     11 |        0.00% |      68.17% |     0.00% | 33% ☆☆☆☆☆
6 |          Probabilists' Hermite |      7 |        0.00% |      66.04% |     0.00% | 50% ☆☆☆☆★
7 |            Physicists' Hermite |     10 |        0.00% |      68.88% |     0.00% | 36% ☆☆☆☆☆

[ How to interpret the results ]
[ Results may be misleading for small datasets (<100 points) ]
 - Score Weight: Relative likelihood of being the best model among the options tested, based on the scoring method used.
 - Fit Quality: Proportion of variance in the data explained by the model (uses huber loss weighted r2).
 - Normality: How closely the residuals follow a normal distribution (useless for small datasets).
 - Rating: Combined score (0.75 * Fit Quality + 0.25 * Normality) to give an overall quality measure.
 - Stars: A simple star rating out of 5 based on the Rating score. Not scientific.
 - The best 3 models are shown below with their equations and plots (if enabled).
  • Less params is a simpler model, which is better
  • Better fit quality means it explains more of the data
  • Better normality means it's probably not underfitting (too simple)
  • The rating is a weighted combination of fit quality and normality to give an overall score

3 posts - 2 participants

Read full topic

🏷️ Rust_feed