π With Meta's recent paper replacing tokenization in LLMs with patches π©Ή, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!
Let's take a trip down memory lane!
[1/N]
16.12.2024 17:31
π 33
π 10
π¬ 4
π 4
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!
08.12.2024 09:19
π 6
π 0
π¬ 1
π 0
HuggingFaceFW/fineweb-2 Β· Datasets at Hugging Face
Weβre on a journey to advance and democratize artificial intelligence through open source and open science.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
Find out all about π₯ FineWeb2 on the π€ model page:
huggingface.co/datasets/Hug...
08.12.2024 09:19
π 4
π 0
π¬ 1
π 0
Announcing π₯ FineWeb2: A sparkling update with 1000s of π£οΈlanguages.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
08.12.2024 09:19
π 76
π 19
π¬ 1
π 0