You can contact us either here or on our Discord server: discord.gg/ABaPnm7fDC
@skrub-data
skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning. Our long-term goal is to directly connect database tables to machine learning estimators. https://skrub-data.org https://discord.gg/ABaPnm7fDC
You can contact us either here or on our Discord server: discord.gg/ABaPnm7fDC
In addition, we will begin crediting specific contributors here on Bluesky when a contributor has worked on the subject of the post. We will use GitHub handles for this purpose. If you prefer your handle not to be used or would like to be credited by name instead, please let us know.
As a follow-up, we would like to clarify how weβll be crediting contributors moving forward.
Currently, all contributions to the repository are tracked in the changelog and highlighted in the release notes, where each PR and the GitHub handle of its author are listed.
Thanks to e-strauss for writing this example!
While skrub Data Ops shine when preparing dataframes, their capabilities extend beyond that. For example, they can be used alongside libraries like PyTorch and skorch to work with images, and tune the model size to find the best set of hyperparameters:
skrub-data.org/stable/auto_...
- A new example has been added to show how skrub Data Ops can be used with pytorch and skorch to solve an image classification task.
skrub-data.org/stable/auto_...
Main changes:
- The StringEncoder now exposes the vocabulary parameter, allowing it to be passed to the underlying TfidfVectorizer.
- The function compute_ngram_distance has been made private to reduce clutter.
- The repository wheel has been made smaller by removing some benchmarking material.
β¨ skrub version 0.7.2 has been released β¨
In this release we squashed more bugs, improved the API reference, and added a new example.
github.com/skrub-data/s...
Here is a full example on how to use skrub Data Ops with Optuna
skrub-data.org/stable/auto_...
At the end, you get a fully-fledged Optuna study to work
with. Of course, that includes support for the Optuna dashboard and access to the Optuna reporting and plotting interfaces.
Three snippets of python code showing how to use skrub Data Ops with the Optuna optimization library.The first snippet shows a standard randomized search with the Data Ops. The second snippet adds the parameter "backend", which is set to "optuna". The third snippet uses the Optuna visualization API to plot information from the study.
Did you know that the skrub Data Ops support Optuna as backend to run hyperparameter search?
It's as easy as writing "backend='optuna'": this will set up a default Optuna study (and the TPE sampler) to replace the standard random sampler.
Happy new year! πππ
Let's celebrate 2026 with a bugfix release that implements some fixes, brings some documentation improvements and adds a new dataset fetcher:
github.com/skrub-data/s...
The course covers:
- How to explore and sanitize data with skrub
- How to use the skrub transformers for powerful and reliable feature engineering
- How to put everything together in a machine learning pipeline
Skrub Data Ops are not included (yet).
Do you want to learn how to use skrub like a pro? Then you're in luck!
Inria Academy is providing an introductory course on skrub aimed at IT personnel, engineers, data scientists, and data analysts.
www.inria-academy.fr/formation/sk...
The recording of the talk we did at @pydataparis.bsky.social 2025 is now available on the PyData Youtube channel! π
You can find it here, if you want to check it out π
www.youtube.com/watch?v=k9MN...
Skrub 0.7.0 is here! π
β¨ Main highlights:
- Tune hyperparameter choices with Optuna
- Added support for Pandas 3.0
- Estimators in data ops can now take additional kwargs
16 new contributors helped with this release π₯
Check out the full changelog: github.com/skrub-data/s...
@skrub-data.bsky.social: better data-science primitives for clean code on dataframes
Watch my dotAI talk, it's fun (live coding)!
www.youtube.com/watch?v=bQS4...
skrub really makes it easy to do machine learning with dataframes
For even more control over column selection, skrub provides a collection of selectors that let you partition dataframes by data type, column name, or user-specified functions.
All these transformers can be concatenated and inserted in a scikit-learn pipeline to build a feature matrix with complex column selection operation, and can be seen as an alternative for the scikit-learn ColumnTransformer.
ApplyToFrame selects columns in the same way, but then uses all of them at the same time as input to the transformer: this is useful for dimensionality reduction.
SelectCols and DropCols can be used as "filtering blocks" in a pipeline.
Skrub includes a powerful set of transformers and selectors that allow to transform columns based on various conditions.
ApplyToCols lets you select a subset of columns in your dataframe, then applies a transformer to each selected column separately.
On vous a déjà dit que Skrub c'est cool ? Et que l'intervention de @riccardocappuzzo.com était très chouette ? Hein, on vous l'a dit ?
skrub-data.org/skrub-materi...
Thanks to @riccardocappuzzo.com , @glemaitre58.bsky.social and Jérôme Dockès for preparing the talk, and mentoring at the sprint!
The sprint was also a big hit, with both new and old contributors working on issues and getting to know the repository.
And to cap it all off, thanks to P16 we have stickers now π
The skrub sticker on the back of a laptop
@pydataparis.bsky.social 2025 is over, and it was a big success!
Our talk was very well received, and we got a lot of great questions, especially about scalability and how to interface with other libraries in production environments.
What a banger is skrub @skrub-data.bsky.social !
Big thumbs up for the sklearn team & the maintainer of this package