Duc Nguyen Huu's Avatar

Duc Nguyen Huu

@ducnh279

Data Science in β™₯️ Home in πŸ‡»πŸ‡³ Kaggle Competitions Master πŸ₯‡ 1 Solo Gold πŸ₯ˆ 2 Silvers (1 Solo, 1 Team) 🌍 Ranked 272 / 202K globally (Top 0.14%)

44
Followers
19
Following
57
Posts
24.11.2024
Joined
Posts Following

Latest posts by Duc Nguyen Huu @ducnh279

Video thumbnail

My new book - on sale NEXT WEEK! πŸŽ‰

Sign up to get notified when it's available: dataschool.kit.com/mlbook

#MachineLearning #Python @scikit-learn.org

27.02.2026 16:15 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

When I first started learning data science, I often got lost in the @scikit-learn.org documentation. After taking the course that became this book, I understood it much better and gained confidence using it.

Scikit-learn is a powerful library,but without guidance, it can feel overwhelming at first.

20.02.2026 18:32 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Sry, I think I found the answer in "Pretraining cost" section πŸ˜…

19.02.2026 09:57 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Great work! I’ve been learning TFMs through TabICL and TabICLv2, your team's well-written papers and educational repo made the things easy to follow.

Could you share how many hours and GPUs were used for data synthesis and pretraining? Since I want to learn and research on TFMs!

19.02.2026 09:07 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

Dream unlocked: I'm publishing my first book! πŸŽ‰πŸŽ‰πŸŽ‰

It's called "Master Machine Learning with scikit-learn: A Practical Guide to Building Better Models with Python"

Download the first 3 chapters right now:
πŸ‘‰ dataschool.kit.com/mlbook πŸ‘ˆ

Thanks for your support πŸ™

11.09.2025 17:53 πŸ‘ 26 πŸ” 6 πŸ’¬ 1 πŸ“Œ 0
Post image

I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! πŸ₯‰

My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link belowπŸ‘‡

29.07.2025 11:10 πŸ‘ 11 πŸ” 2 πŸ’¬ 2 πŸ“Œ 0

Congratulations! I'm reading your paper after seeing some of RealMLP's success on Kaggle. It’s not widespread yet, but it's quite impressive.

30.07.2025 12:30 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

This work is presented at ICML next week.
β€’ The paper arxiv.org/html/2502.05...
β€’ The python package: pypistats.org/packages/tab... (try it out 🐍)
β€’ The source code github.com/soda-inria/t... (100% open source, including pre-training πŸ’ž)

Longer read (5mn): gael-varoquaux.info/science/tabi...
8/9

09.07.2025 18:41 πŸ‘ 12 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Post image Post image

πŸ‘¨β€πŸŽ“πŸ§Ύβœ¨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan

TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9

09.07.2025 18:41 πŸ‘ 50 πŸ” 15 πŸ’¬ 1 πŸ“Œ 0
Preview
AI progress in 2025 πŸ“ˆ Thoughts on the current state of AI progress and the most important developments in 2025

My thoughts on the current state of AI progress and the most important developments in 2025:

www.dataschool.io/ai-progress-...

28.05.2025 14:17 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

I recently used TableVectorizer in a Kaggle tabular competition! It took quite a bit of effort to beat a strong baseline using TabVec + HistGB 🀣

29.05.2025 21:47 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
20.04.2025 12:25 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
A little pooling goes a long way for multi-vector representations – Answer.AI Practical AI R&D

Are you familiar with Token Pooling?

Models that use late interaction, like ColBERT, ColPali, and ColQwen, gain significant benefits from this pooling technique! By integrating token pooling methods, the number of vectors to store can be reduced.

Blog: www.answer.ai/posts/colber...

04.04.2025 23:41 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
AI Mathematical Olympiad - Progress Prize 2 Solve national-level math challenges using artificial intelligence models

Efficiently scale long CoT models like DeepSeek when using Best-of-N or Majority Voting by early pruning reasoning chains.

Kaggle Discussion: www.kaggle.com/competitions...

04.04.2025 19:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

I find making your agents safe is just as important as making them smart. πŸ”’

A good read for building secure AI!

arxiv.org/pdf/2503.18813

31.03.2025 12:47 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

@dataschool.io I hope that one day soon, I can meet you in person to say thank you for your DS education. πŸ₯°

30.03.2025 19:45 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

There will be one day ... in πŸ‡ΊπŸ‡Έ or πŸ‡»πŸ‡³

30.03.2025 19:37 πŸ‘ 1 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Post image

Claude finally integrated web search into its results...

But with LangChain & LangGraph, you can build a chatbot that integrates web search into ANY model you like!

You'll learn how to do that (and much more) in my new AI course...

Sign up for EARLY ACCESS:
πŸ‘‰ dataschool.kit.com/agents πŸ‘ˆ

27.03.2025 11:58 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

Excited to join the class in May πŸ₯°

24.03.2025 16:33 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I guess you were quite busy last week? πŸ˜…

24.03.2025 16:18 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

A practical way for students to secure jobs and earn money is by developing real-world projects. Researching or engineering LLMs often seems like a field dominated by the big tech!

It's still important to learn fundamentals from scratch for growth and problem-solving (e.g be able to fix things)! 😁

24.03.2025 16:17 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling: www.youtube.com/watch?v=Zar2...

23.03.2025 13:38 πŸ‘ 61 πŸ” 12 πŸ’¬ 0 πŸ“Œ 0

cuDF-pandas (%load_ext cudf.pandas) with Rapids ... work similarly and super cool to see we will be able to speed up scikit-learn

20.03.2025 09:59 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Very cool to know 🐼 i will give it a try!!! πŸ’¨

20.03.2025 10:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
NVIDIA cuML Brings Zero Code Change Acceleration to scikit-learn | NVIDIA Technical Blog Scikit-learn, the most widely used ML library, is popular for processing tabular data because of its simple API, diversity of algorithms, and compatibility with popular Python libraries such as pandas...

Scikit-learn accelerated πŸš€

My company has a bunch of unused T4 GPUs because the LLMs are too big for AI teams run exps. Now the data science team finally has a reason to ask for them! 🀣

developer.nvidia.com/blog/nvidia-...

20.03.2025 07:57 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

A cool alternative to sharing a notebook! πŸ˜‚

18.03.2025 18:56 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
How to calculate "scoring streaks" with pandas πŸ€ Learn how to identify & analyze consecutive events in your data using advanced DataFrame methods!

In honor of March Madness πŸ€, I've got a new blog post:

www.dataschool.io/pandas-strea...

Learn how to identify & analyze scoring streaks using pandas operations:

- shift()
- cumsum()
- boolean math
- groupby()

17.03.2025 13:53 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Many good advices/best practices for missing value imputation in the paper!

I now have a much deeper appreciation for Data School's course and regard it as the best scikit-learn course.

Master Machine Learning with scikit-learn: courses.dataschool.io/master-machi...

18.03.2025 15:55 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
DeepSeek-R1 Uncensored, QwQ-32B Puts Reasoning in Smaller Model, and more... The Batch AI News and Insights: Some people today are discouraging others from learning programming on the grounds AI will automate it.

"Some people today are discouraging others from learning programming on the grounds AI will automate it. This advice will be seen as some of the worst career advice ever given."

-- Andrew Ng, legendary AI researcher

Source: www.deeplearning.ai/the-batch/is...

13.03.2025 18:05 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
The Future of AI & Machine Learning | The Python Exchange February 2025
The Future of AI & Machine Learning | The Python Exchange February 2025 YouTube video by Don't Use This Code β€’ James Powell

A recent talk, fully in a vscode: 100% code on data wrangling for machine learning with @skrub-data.bsky.social
www.youtube.com/watch?v=hdWW...

super powerful to easily assemble production-ready pipelines in easy syntax

14.03.2025 15:15 πŸ‘ 23 πŸ” 5 πŸ’¬ 2 πŸ“Œ 0