My new book - on sale NEXT WEEK! π
Sign up to get notified when it's available: dataschool.kit.com/mlbook
#MachineLearning #Python @scikit-learn.org
My new book - on sale NEXT WEEK! π
Sign up to get notified when it's available: dataschool.kit.com/mlbook
#MachineLearning #Python @scikit-learn.org
When I first started learning data science, I often got lost in the @scikit-learn.org documentation. After taking the course that became this book, I understood it much better and gained confidence using it.
Scikit-learn is a powerful library,but without guidance, it can feel overwhelming at first.
Sry, I think I found the answer in "Pretraining cost" section π
Great work! Iβve been learning TFMs through TabICL and TabICLv2, your team's well-written papers and educational repo made the things easy to follow.
Could you share how many hours and GPUs were used for data synthesis and pretraining? Since I want to learn and research on TFMs!
Dream unlocked: I'm publishing my first book! πππ
It's called "Master Machine Learning with scikit-learn: A Practical Guide to Building Better Models with Python"
Download the first 3 chapters right now:
π dataschool.kit.com/mlbook π
Thanks for your support π
I got 3rd out of 691 in a tabular kaggle competition β with only neural networks! π₯
My solution is short (48 LOC) and relatively general-purpose β I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link belowπ
Congratulations! I'm reading your paper after seeing some of RealMLP's success on Kaggle. Itβs not widespread yet, but it's quite impressive.
This work is presented at ICML next week.
β’ The paper arxiv.org/html/2502.05...
β’ The python package: pypistats.org/packages/tab... (try it out π)
β’ The source code github.com/soda-inria/t... (100% open source, including pre-training π)
Longer read (5mn): gael-varoquaux.info/science/tabi...
8/9
π¨βππ§Ύβ¨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan
TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9
My thoughts on the current state of AI progress and the most important developments in 2025:
www.dataschool.io/ai-progress-...
I recently used TableVectorizer in a Kaggle tabular competition! It took quite a bit of effort to beat a strong baseline using TabVec + HistGB π€£
Are you familiar with Token Pooling?
Models that use late interaction, like ColBERT, ColPali, and ColQwen, gain significant benefits from this pooling technique! By integrating token pooling methods, the number of vectors to store can be reduced.
Blog: www.answer.ai/posts/colber...
Efficiently scale long CoT models like DeepSeek when using Best-of-N or Majority Voting by early pruning reasoning chains.
Kaggle Discussion: www.kaggle.com/competitions...
I find making your agents safe is just as important as making them smart. π
A good read for building secure AI!
arxiv.org/pdf/2503.18813
@dataschool.io I hope that one day soon, I can meet you in person to say thank you for your DS education. π₯°
There will be one day ... in πΊπΈ or π»π³
Claude finally integrated web search into its results...
But with LangChain & LangGraph, you can build a chatbot that integrates web search into ANY model you like!
You'll learn how to do that (and much more) in my new AI course...
Sign up for EARLY ACCESS:
π dataschool.kit.com/agents π
Excited to join the class in May π₯°
I guess you were quite busy last week? π
A practical way for students to secure jobs and earn money is by developing real-world projects. Researching or engineering LLMs often seems like a field dominated by the big tech!
It's still important to learn fundamentals from scratch for growth and problem-solving (e.g be able to fix things)! π
My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling: www.youtube.com/watch?v=Zar2...
cuDF-pandas (%load_ext cudf.pandas) with Rapids ... work similarly and super cool to see we will be able to speed up scikit-learn
Very cool to know πΌ i will give it a try!!! π¨
Scikit-learn accelerated π
My company has a bunch of unused T4 GPUs because the LLMs are too big for AI teams run exps. Now the data science team finally has a reason to ask for them! π€£
developer.nvidia.com/blog/nvidia-...
A cool alternative to sharing a notebook! π
In honor of March Madness π, I've got a new blog post:
www.dataschool.io/pandas-strea...
Learn how to identify & analyze scoring streaks using pandas operations:
- shift()
- cumsum()
- boolean math
- groupby()
Many good advices/best practices for missing value imputation in the paper!
I now have a much deeper appreciation for Data School's course and regard it as the best scikit-learn course.
Master Machine Learning with scikit-learn: courses.dataschool.io/master-machi...
"Some people today are discouraging others from learning programming on the grounds AI will automate it. This advice will be seen as some of the worst career advice ever given."
-- Andrew Ng, legendary AI researcher
Source: www.deeplearning.ai/the-batch/is...
A recent talk, fully in a vscode: 100% code on data wrangling for machine learning with @skrub-data.bsky.social
www.youtube.com/watch?v=hdWW...
super powerful to easily assemble production-ready pipelines in easy syntax