Guilherme Penedo's Avatar

Guilherme Penedo

@guilherme.hf.co

ML Research Engineer at πŸ€—. Lisboeta πŸ‡΅πŸ‡Ή

612
Followers
66
Following
3
Posts
13.11.2024
Joined
Posts Following

Latest posts by Guilherme Penedo @guilherme.hf.co

Post image

πŸš€ With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

16.12.2024 17:31 πŸ‘ 33 πŸ” 10 πŸ’¬ 4 πŸ“Œ 4
Post image

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

08.12.2024 09:19 πŸ‘ 6 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
HuggingFaceFW/fineweb-2 Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

Find out all about πŸ₯‚ FineWeb2 on the πŸ€— model page:
huggingface.co/datasets/Hug...

08.12.2024 09:19 πŸ‘ 4 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Announcing πŸ₯‚ FineWeb2: A sparkling update with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other datasets.

08.12.2024 09:19 πŸ‘ 76 πŸ” 19 πŸ’¬ 1 πŸ“Œ 0