Duc Nguyen Huu (@ducnh279)

My new book - on sale NEXT WEEK! 🎉

Sign up to get notified when it's available: dataschool.kit.com/mlbook

#MachineLearning #Python @scikit-learn.org

27.02.2026 16:15 👍 3 🔁 1 💬 0 📌 0

When I first started learning data science, I often got lost in the @scikit-learn.org documentation. After taking the course that became this book, I understood it much better and gained confidence using it.

Scikit-learn is a powerful library,but without guidance, it can feel overwhelming at first.

20.02.2026 18:32 👍 1 🔁 0 💬 1 📌 0

Sry, I think I found the answer in "Pretraining cost" section 😅

19.02.2026 09:57 👍 1 🔁 0 💬 0 📌 0

Great work! I’ve been learning TFMs through TabICL and TabICLv2, your team's well-written papers and educational repo made the things easy to follow.

Could you share how many hours and GPUs were used for data synthesis and pretraining? Since I want to learn and research on TFMs!

19.02.2026 09:07 👍 2 🔁 0 💬 1 📌 0

Dream unlocked: I'm publishing my first book! 🎉🎉🎉

It's called "Master Machine Learning with scikit-learn: A Practical Guide to Building Better Models with Python"

Download the first 3 chapters right now:
👉 dataschool.kit.com/mlbook 👈

Thanks for your support 🙏

11.09.2025 17:53 👍 26 🔁 6 💬 1 📌 0

I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉

My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇

29.07.2025 11:10 👍 11 🔁 2 💬 2 📌 0

Congratulations! I'm reading your paper after seeing some of RealMLP's success on Kaggle. It’s not widespread yet, but it's quite impressive.

30.07.2025 12:30 👍 1 🔁 0 💬 1 📌 0

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

This work is presented at ICML next week.
• The paper arxiv.org/html/2502.05...
• The python package: pypistats.org/packages/tab... (try it out 🐍)
• The source code github.com/soda-inria/t... (100% open source, including pre-training 💞)

Longer read (5mn): gael-varoquaux.info/science/tabi...
8/9

09.07.2025 18:41 👍 12 🔁 2 💬 0 📌 0

👨‍🎓🧾✨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan

TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9

09.07.2025 18:41 👍 50 🔁 15 💬 1 📌 0

AI progress in 2025 📈 Thoughts on the current state of AI progress and the most important developments in 2025

My thoughts on the current state of AI progress and the most important developments in 2025:

www.dataschool.io/ai-progress-...

28.05.2025 14:17 👍 1 🔁 1 💬 0 📌 0

I recently used TableVectorizer in a Kaggle tabular competition! It took quite a bit of effort to beat a strong baseline using TabVec + HistGB 🤣

29.05.2025 21:47 👍 1 🔁 0 💬 0 📌 0

20.04.2025 12:25 👍 0 🔁 0 💬 0 📌 0

A little pooling goes a long way for multi-vector representations – Answer.AI Practical AI R&D

Are you familiar with Token Pooling?

Models that use late interaction, like ColBERT, ColPali, and ColQwen, gain significant benefits from this pooling technique! By integrating token pooling methods, the number of vectors to store can be reduced.

Blog: www.answer.ai/posts/colber...

04.04.2025 23:41 👍 0 🔁 0 💬 0 📌 0

AI Mathematical Olympiad - Progress Prize 2 Solve national-level math challenges using artificial intelligence models

Efficiently scale long CoT models like DeepSeek when using Best-of-N or Majority Voting by early pruning reasoning chains.

Kaggle Discussion: www.kaggle.com/competitions...

04.04.2025 19:48 👍 0 🔁 0 💬 0 📌 0

I find making your agents safe is just as important as making them smart. 🔒

A good read for building secure AI!

arxiv.org/pdf/2503.18813

31.03.2025 12:47 👍 1 🔁 0 💬 0 📌 0

@dataschool.io I hope that one day soon, I can meet you in person to say thank you for your DS education. 🥰

30.03.2025 19:45 👍 1 🔁 0 💬 0 📌 0

There will be one day ... in 🇺🇸 or 🇻🇳

30.03.2025 19:37 👍 1 🔁 0 💬 2 📌 0

Claude finally integrated web search into its results...

But with LangChain & LangGraph, you can build a chatbot that integrates web search into ANY model you like!

You'll learn how to do that (and much more) in my new AI course...

Sign up for EARLY ACCESS:
👉 dataschool.kit.com/agents 👈

27.03.2025 11:58 👍 2 🔁 2 💬 0 📌 0

Excited to join the class in May 🥰

24.03.2025 16:33 👍 1 🔁 0 💬 0 📌 0

I guess you were quite busy last week? 😅

24.03.2025 16:18 👍 1 🔁 0 💬 1 📌 0

A practical way for students to secure jobs and earn money is by developing real-world projects. Researching or engineering LLMs often seems like a field dominated by the big tech!

It's still important to learn fundamentals from scratch for growth and problem-solving (e.g be able to fix things)! 😁

24.03.2025 16:17 👍 2 🔁 0 💬 0 📌 0

My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling: www.youtube.com/watch?v=Zar2...

23.03.2025 13:38 👍 61 🔁 12 💬 0 📌 0

cuDF-pandas (%load_ext cudf.pandas) with Rapids ... work similarly and super cool to see we will be able to speed up scikit-learn

20.03.2025 09:59 👍 1 🔁 1 💬 1 📌 0

Very cool to know 🐼 i will give it a try!!! 💨

20.03.2025 10:54 👍 1 🔁 0 💬 0 📌 0

NVIDIA cuML Brings Zero Code Change Acceleration to scikit-learn | NVIDIA Technical Blog Scikit-learn, the most widely used ML library, is popular for processing tabular data because of its simple API, diversity of algorithms, and compatibility with popular Python libraries such as pandas...

Scikit-learn accelerated 🚀

My company has a bunch of unused T4 GPUs because the LLMs are too big for AI teams run exps. Now the data science team finally has a reason to ask for them! 🤣

developer.nvidia.com/blog/nvidia-...

20.03.2025 07:57 👍 2 🔁 0 💬 1 📌 0

A cool alternative to sharing a notebook! 😂

18.03.2025 18:56 👍 1 🔁 0 💬 1 📌 0

How to calculate "scoring streaks" with pandas 🏀 Learn how to identify & analyze consecutive events in your data using advanced DataFrame methods!

In honor of March Madness 🏀, I've got a new blog post:

www.dataschool.io/pandas-strea...

Learn how to identify & analyze scoring streaks using pandas operations:

- shift()
- cumsum()
- boolean math
- groupby()

17.03.2025 13:53 👍 1 🔁 1 💬 0 📌 0

Many good advices/best practices for missing value imputation in the paper!

I now have a much deeper appreciation for Data School's course and regard it as the best scikit-learn course.

Master Machine Learning with scikit-learn: courses.dataschool.io/master-machi...

18.03.2025 15:55 👍 2 🔁 1 💬 1 📌 0

DeepSeek-R1 Uncensored, QwQ-32B Puts Reasoning in Smaller Model, and more... The Batch AI News and Insights: Some people today are discouraging others from learning programming on the grounds AI will automate it.

"Some people today are discouraging others from learning programming on the grounds AI will automate it. This advice will be seen as some of the worst career advice ever given."

-- Andrew Ng, legendary AI researcher

Source: www.deeplearning.ai/the-batch/is...

13.03.2025 18:05 👍 1 🔁 1 💬 0 📌 0

The Future of AI & Machine Learning | The Python Exchange February 2025 YouTube video by Don't Use This Code • James Powell

A recent talk, fully in a vscode: 100% code on data wrangling for machine learning with @skrub-data.bsky.social
www.youtube.com/watch?v=hdWW...

super powerful to easily assemble production-ready pipelines in easy syntax

14.03.2025 15:15 👍 23 🔁 5 💬 2 📌 0

Duc Nguyen Huu

Latest posts by Duc Nguyen Huu @ducnh279