Polyfit - Because statistics is hard, and linear regression is made entirely out of footguns

⚓ Rust 📅 2026-01-28 👤 surdeus 👁️ 1

Info

This post is auto-generated from RSS feed The Rust Programming Language Forum - Latest topics. Source: Polyfit - Because statistics is hard, and linear regression is made entirely out of footguns

I needed to draw a curve fit through some data, and it turned into a year long rabbit hole, where I discovered that stats is really involved, and that the rust ecosystem is a bit barren in terms of user-friendly batteries-included polynomial fitting libraries.

So I built Polyfit - Because you don't need to be able to build a powerdrill to use one safely.

The full power of polynomial fitting without needing to understand all the math
Sensible parameters (DegreeBound, scoring metrics, basis functions) that don't feel arbitrary or like magic numbers
Extensive documentation, examples, and built in testing tools

GitHub | Crates.io | Documentation | Homepage

My goals for the project were:

Never ask for a number without context - ask for a random number and you get a random number
- Instead, if I can derive the correct value myself I do
- If I can't, I have named presets that describe in detail why you'd pick them
Provide sensible defaults for everything
- If you don't care about a setting, you shouldn't have to think about it
- You should not need to understand the math to get good results
Performance
- I tried to prioritize speed and memory efficiency where possible
- On my fairly average laptop, I can do a 100 million point fit in ~1s
You need to be able to test it
- Not understanding the math shouldn't be a barrier to making sure it works
- There's a whole test suite included with extensive docs, examples, and sensible defaults
- The tests even generate a plot on failure so you can see what went wrong
- And I included a set of random noise injection transforms to help you make synthetic data for testing
- The tests will even show seeds used on failure for reproducibility

Here's some examples of why you'd want to use Polyfit

Oh no! I have all this data and I need to draw a line through it

use polyfit::{
    score::Aic,
    statistics::DegreeBound,
    ChebyshevFit,
};

let mut fit = ChebyshevFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;
let equation = fit.as_monomial()?.to_string();
let pretty_line = fit.solve_range(0.0..=100.0, 1.0)?;

Chebyshev fitting is more numerically stable so it's a good default choice
DegreeBound::Relaxed uses your data to pick a reasonable degree without overfitting
Aic is a scoring metric. Smallish datasets tend to do well with it

We use as_monomial to get the equation in a human readable format.

Oh gee willikers How am I going to figure out which of these data points are outliers

let covariance = fit.covariance()?; // It's the thing that tells us how certain we are about the fit just roll with it
let outliers = covariance.outliers(Confidence::P95, Some(Tolerance::Absolute(0.1)))?;

The Confidence is just a measure of how much you trust the fit. P95 is a good option
I added Tolerance because real world data is messy. If I know my sensor is only accurate to +/- 0.1 units I shouldn't need to mess with the confidence level to account for that. It's basically an engineering correction for Confidence

I also have extensive calculus support, so

Say you have weather data with temperature over time:

More Details

use polyfit::{FourierFit, score::Aic, statistics::DegreeBound};
let fit = FourierFit::new_auto(&data, DegreeBound::Relaxed, &Aic)?;

// Derivatives for rates of change
// Critical points are neat for this
// This tells us when the temperature stops rising or falling and starts doing the opposite
for point in fit.critical_points()? {
    match p {
        CriticalPoint::Minima(x, _y_) => println!("Found a local minimum at x = {}", x),
        CriticalPoint::Maxima(x, _y_) => println!("Found a local maximum at x = {}", x),
        CriticalPoint::Inflection(x, _y_) => println!("Found an inflection point at x = {}", x),
    }
}

There's too many options how do I pick a basis for my data!

First read these:

And also call basis_select!()

It tests your data on every basis I support and gives you an easy to digest report:

  |             Basis              | Params | Score Weight | Fit Quality | Normality | Rating
--|--------------------------------|--------|--------------|-------------|-----------|-----------
1 |                        Fourier |      9 |      100.00% |      99.00% |    67.80% | 71% ☆☆★★★
2 |                       Laguerre |     11 |        0.00% |      69.86% |     0.00% | 33% ☆☆☆☆☆
3 |                       Legendre |     11 |        0.00% |      70.91% |     0.00% | 34% ☆☆☆☆☆
--|--------------------------------|--------|--------------|-------------|-----------|-----------
4 |                      Chebyshev |     11 |        0.00% |      70.91% |     0.00% | 34% ☆☆☆☆☆
5 |                    Logarithmic |     11 |        0.00% |      68.17% |     0.00% | 33% ☆☆☆☆☆
6 |          Probabilists' Hermite |      7 |        0.00% |      66.04% |     0.00% | 50% ☆☆☆☆★
7 |            Physicists' Hermite |     10 |        0.00% |      68.88% |     0.00% | 36% ☆☆☆☆☆

[ How to interpret the results ]
[ Results may be misleading for small datasets (<100 points) ]
 - Score Weight: Relative likelihood of being the best model among the options tested, based on the scoring method used.
 - Fit Quality: Proportion of variance in the data explained by the model (uses huber loss weighted r2).
 - Normality: How closely the residuals follow a normal distribution (useless for small datasets).
 - Rating: Combined score (0.75 * Fit Quality + 0.25 * Normality) to give an overall quality measure.
 - Stars: A simple star rating out of 5 based on the Rating score. Not scientific.
 - The best 3 models are shown below with their equations and plots (if enabled).

Less params is a simpler model, which is better
Better fit quality means it explains more of the data
Better normality means it's probably not underfitting (too simple)
The rating is a weighted combination of fit quality and normality to give an overall score

3 posts - 2 participants

Read full topic

🏷️ Rust_feed

👍 󠁮󠁮󠁮󠁮 👎 󠁮󠁮󠁮󠁮