Start synthesizing ๐: huggingface.co/spaces/argil...
โ Blog post: huggingface.co/blog/sdiazlo...
Start synthesizing ๐: huggingface.co/spaces/argil...
โ Blog post: huggingface.co/blog/sdiazlo...
๐ซ Generate RAG data with the Synthetic Data Generator to improve your RAG system!
1๏ธโฃ Generate from your documents, dataset, or dataset description.
2๏ธโฃ Configure it.
3๏ธโฃ Generate the synthetic dataset.
4๏ธโฃ Fine-tune the retrieval and reranking models.
5๏ธโฃ Build a RAG pipeline.
๐ Argilla v2.6.0 is here! ๐
Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. ๐คฉ
Take a look to this quick demo ๐
๐โโ๏ธ More info about the release at github.com/argilla-io/a...
#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla
๐
โโ๏ธ No-code end-to-end example to train your model
1๏ธโฃ Use the Synthetic Data Generator to create your custom dataset
2๏ธโฃ Use AutoTrain to use the generated dataset and train your model
Check it here: huggingface.co/blog/synthet...
- No code requiredโeverything can be handled through the interface.
- 100% free to use.
- Designed to create text classification and chat datasets.
- Review in Argilla and push to the Hub.
Where do I get quality data from? We often need to fine-tune models for very specific scenarios. And thatโs where the Synthetic Data Generator comes in!
Want to see how it works? Watch this quick video (www.youtube.com/watch?v=nXjV...) and get started here: t.co/hJ1b2TsMq0
Pouco a pouco avanzamos! ๐ Anรญmovos a contribuir, tan sรณ tedes que entrar na ligazรณn, ler as instruciรณns e comezar a anotar โ
data-is-better-together-fineweb-c.hf.space/share-your-p...
It only takes 2 steps:
- Coordinate with your Language Lead: huggingface.co/spaces/Huggi.... Or become one if it is missing: huggingface.co/spaces/natal...
- Read the guidelines and start annotating according to the educational value: huggingface.co/spaces/data-...
Spanish, Filipino, Amharic, French, German, Basque, Catalan, Galician, Guarani, Telugu, Italian, Pashto, Romanian, Tamil, Urdu, Danish... and many more! All included in the FineWeb2 Community Annotation Sprint! ๐ฅ
๐ซ Join to build an impactful dataset for your language!
Binarized dataset: huggingface.co/datasets/dat...
Blog post: huggingface.co/blog/image-p...
Open Image Preferences released! ๐
- Open-source dataset for text2image
- 10K samples manually evaluated by the HF community.
- Binarized format for SFT, DPO, or ORPO.
It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.
I'd say that more small models and focus on agents and on-device
This is crazy! Were you right in your predictions?
Language is power! A multilingual annotation sprint for hundreds of languages is starting soon! Step up as a Language Lead and help drive this effort for your language.
If there's already a Language Lead, stay tuned! Is this the start of a nice community?
docs.google.com/forms/d/e/1F...
Lovely!
To end the week on a high note, my furry friend โญ
Want to improve your model quality? Implement the data annotation stage in your MLOps effortlessly thanks to the enhanced integration of Argilla with ZenML.
โจUse the latest Argilla features
โจImprove human-in-the-loop workflows
โจManage datasets, track progress, and coordinate your annotation team
Model: huggingface.co/Qwen/QwQ-32B...
Demo: huggingface.co/spaces/Qwen/...
๐ QwQ-32B-Preview is available on the Hub!
> The results are very promising, beating o1-mini.
> However, they also have several limitations you might notice even in the demo (I found endless reasoning trying to find out the number of 'r' in ๐). So, let's see how they deal with them.
I do. Big AI companies stealing our data have put us on guard, but good intentions also exist. So, let's learn together from this and find ways to continue building with consent and transparency for everyone, not just those in power.
It's pretty sad to see the negative sentiment towards Hugging Face on this platform due to a dataset put by one of the employees. I want to write a small piece. ๐งต
Hugging Face empowers everyone to use AI to create value and is against monopolization of AI it's a hosting platform above all.
Hugging Face inference endpoints now support CPU deployment for llama.cpp ๐ ๐
Why this is a huge deal? Llama.cpp is well-known for running very well on CPU. If you're running small models like Llama 1B or embedding models, this will definitely save tons of money ๐ฐ ๐ฐ
Steps:
1๏ธโฃ Log in to the Argilla Space with your HF account: huggingface.co/spaces/data-...
2๏ธโฃ Check the guidelines.
3๏ธโฃ Time to start annotating!
Can you climb to the top of the leaderboard? huggingface.co/spaces/data-...
๐จ Help to build an image preference dataset!
> Goal: Release an open-source image dataset, enabling the entire community to benefit from it.
> Requirements: All you need is a Hugging Face account and a willingness to contribute.
More in ๐งต
Let's make AI more inclusive.
At @huggingface.bsky.social we'll launch a huge community sprint soon to build high-quality training datasets for many languages.
We're looking for Language Leads to help with outreach.
Find your language and nominate yourself:
forms.gle/iAJVauUQ3FN8...
"Naftali was assigned to train AI to recognize and weed out pornography, hate speech and excessive violence, which meant sifting through the worst of the worst content online for hours on end."
So much of AI is based on exploiting workers in precarious conditions ๐
www.cbsnews.com/news/labeler...
Reach out to us at:
- GitHub: github.com/argilla-io/a...
- Discord ( #argilla-distilabel-general or #argilla-distilabel-help): hf.co/join/discord