The best part is: you can start using our methods today by pip installing our open source package probmetrics.
π Read the paper: arxiv.org/abs/2511.03685
π©βπ» Calibrate your models with probmetrics: github.com/dholzmueller...
The best part is: you can start using our methods today by pip installing our open source package probmetrics.
π Read the paper: arxiv.org/abs/2511.03685
π©βπ» Calibrate your models with probmetrics: github.com/dholzmueller...
Still using temperature scaling?
With @dholzmueller.bsky.social, Michael I. Jordan and @bachfrancis.bsky.social we argue that with well designed regularization, more expressive models like matrix scaling can outperform simpler ones across calibration set sizes, data dimensions, and applications.
COLT Workshop on Predictions and Uncertainty was a banger!
I was lucky to present our paper "Minimum Volume Conformal Sets for Multivariate Regression", alongside my colleague @eberta.bsky.social and his awsome work on calibration.
Big thanks to the organizers!
#ConformalPrediction #MarcoPolo
What if we have been doing early stopping wrong all along?
When you break the validation loss into two terms, calibration and refinement
you can make the simplest (efficient) trick to stop training in a smarter position
This suggests a clear link with the ROC curve in the binary case, but writing it down formally, the relationship between the two is a bit uglyβ¦
Isotonic regression minimizes the risk of any « Bregman loss function » (included cross-entropy, see section 2.1 below) up to monotonic relabeling, which looks a lot like our « refinement as a minimiser » formulation. It also find the ROC convex hull.
proceedings.mlr.press/v238/berta24...
However, for calibration of the final model, adding an intercept or doing matrix scaling might work even better in certain scenario (imbalanced, non-centered). Weβve experimented with existing implementation with limited success for now, maybe we should look at that in more detailsβ¦
Not yet! Vector/matrix scaling has more parameters so it is more prone to overfitting the validation set, and simple TS seems to calibrate well empirically, which is why we stuck with that to estimate refinement error for early stopping.
Iβve observed refinement being minimized before calibration for small (probably under-fitter) neural nets. In many cases, the refinement curve also starts « overfittingΒ Β» at some point.
Weβve not tried what youβre suggesting but if the training cost is small this might indeed be a good option!
Indeed regularisation seems very important. It can have large impact on how calibration error behaves. Combined with learning rate schedulers, this can have surprising effects, like calibration error starting to go down again at some point.
Thanks! We have experimented with many models, observing various behaviours. The Β« calibration going up while refinement goes downΒ Β» seems typical in deep learning from what Iβve seen. With smaller models other things can appear, as suggested by our logistic regression analysis (section 6).
π Read the full paper: arxiv.org/abs/2501.19195
π» Check out our code: github.com/dholzmueller...
Early stopping on validation loss? This leads to suboptimal calibration and refinement errorsβbut you can do better!
With @dholzmueller.bsky.social, Michael I. Jordan, and @bachfrancis.bsky.social, we propose a method that integrates with any model and boosts classification performance across tasks.