Gabriel Martín Blázquez (@gabrielmb.com)

Link to Hugging Face paper page: https://huggingface.co/papers/2502.02737

06.02.2025 10:56 👍 0 🔁 0 💬 0 📌 0

SmolLM2 paper is out! We wrote a paper detailing the steps we took to train one of the best smol LM 🤏 out there: pre-training and post-training data, training ablations and some interesting findings 💡

Go check it out and don't hesitate to write your thoughts/questions in the comments section!

06.02.2025 10:56 👍 4 🔁 0 💬 1 📌 0

distilabel ⚗️ reached the 2k ⭐️ on GitHub!

27.01.2025 15:58 👍 2 🔁 0 💬 0 📌 0

GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1 Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.

We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

Follow along: github.com/huggingface/...

25.01.2025 13:29 👍 200 🔁 36 💬 6 📌 6

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

19.12.2024 15:55 👍 46 🔁 15 💬 2 📌 1

🚀 Argilla v2.6.0 is here! 🎉

Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩

Take a look to this quick demo 👇

💁‍♂️ More info about the release at github.com/argilla-io/a...

#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla

19.12.2024 12:39 👍 11 🔁 5 💬 0 📌 1

That's 100% true. To be honest, all the regular expressions that I've used in the last months have been written by an LLM... Most of the time they work at first try, but when they don't it's a pain.

13.12.2024 15:28 👍 0 🔁 0 💬 0 📌 0

How many regular expressions have you written without the help of an LLM since ChatGPT appeared?

13.12.2024 10:00 👍 4 🔁 0 💬 2 📌 0

The FineWeb team is happy to finally release "FineWeb2" 🥂🥳

FineWeb 2 extends the data driven approach to pre-training dataset design that was introduced in FineWeb 1 to now covers 1893 languages/scripts

Details: huggingface.co/datasets/Hug...

A detailed open-science tech report is coming soon

08.12.2024 09:08 👍 106 🔁 13 💬 3 📌 2

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

03.12.2024 09:21 👍 326 🔁 64 💬 9 📌 4

It's just me or the latest Claude 3.5 Sonnet is too prone to generate code when asking technical questions not directly related to coding?

27.11.2024 09:24 👍 5 🔁 0 💬 5 📌 0

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

26.11.2024 15:57 👍 104 🔁 22 💬 4 📌 4

argilla/magpie-ultra-v1.0 · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

As part of the SmolTalk release, the dataset mixture used for @huggingface.bsky.social SmolLM2 model, we built a new version of the MagPie Ultra dataset using Llama 405B Instruct.

It contains 1M rows of multi-turn conversations with diverse instructions!

huggingface.co/datasets/arg...

26.11.2024 15:22 👍 3 🔁 0 💬 0 📌 0

Let's make AI more inclusive.

At @huggingface.bsky.social we'll launch a huge community sprint soon to build high-quality training datasets for many languages.

We're looking for Language Leads to help with outreach.

Find your language and nominate yourself:
forms.gle/iAJVauUQ3FN8...

26.11.2024 06:29 👍 53 🔁 21 💬 8 📌 4

data-is-better-together (Data Is Better Together) Building better datasets together

I am very excited to launch a new community initiative next week.

Let's build the largest open community dataset to evaluate and improve image generation models.

Follow:
huggingface.co/data-is-bett...

And stay tuned here

24.11.2024 17:51 👍 85 🔁 12 💬 1 📌 4

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

24.11.2024 07:16 👍 59 🔁 10 💬 2 📌 0

Thank you Marco!

23.11.2024 12:55 👍 1 🔁 0 💬 0 📌 0

smollm/distilabel_pipelines at main · huggingface/smollm Contribute to huggingface/smollm development by creating an account on GitHub.

We will soon release all the distilabel code used to generate the datasets. As a sneak peak, you can already check the code used for MagPie Ultra v1.0 here:

github.com/huggingface/...

21.11.2024 15:26 👍 2 🔁 0 💬 1 📌 0

The dataset allowed to enhance the instruction following and reasoning of SmolLM2 with respect to the previous version. It also includes instructions for rewriting, summarization and function calling.

21.11.2024 15:25 👍 2 🔁 0 💬 1 📌 0

HuggingFaceTB/smoltalk · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

21.11.2024 15:22 👍 24 🔁 8 💬 1 📌 1

Google Colab

Here's a notebook where I do SFT SmolLM2 on the synthetic dataset: colab.research.google.com/drive/1lioed...

thanks @philschmid.bsky.social for the finetuning code
thanks @huggingface.bsky.social for the smol model
thanks @qgallouedec.bsky.social and friends for TRL

21.11.2024 10:34 👍 13 🔁 1 💬 0 📌 1

The great exile!

For those who don’t know me, I’m Gabriel, ML Engineer at @huggingface.bsky.social where I work developing tools like distilabel or Argilla for you to take care of your data 🤗

The content of my posts here will be mainly related to synthetic data and LLM post-training.

20.11.2024 07:29 👍 16 🔁 2 💬 0 📌 1

Gabriel Martín Blázquez

Latest posts by Gabriel Martín Blázquez @gabrielmb.com