Letβs build a more robust foundation for LLM evaluation!
A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:
@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
Letβs build a more robust foundation for LLM evaluation!
A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:
@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social
3. Some instances are consistently easy or hard across ALL prompts, no matter how you prompt: models either always succeed or consistently fail.
2. Selecting prompt characteristics (e.g., phrasing, enumerators) based on past examples helps efficiently find optimal prompts.
Key findings from ποΈ DOVE:
1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. β‘ OLMoβs accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).
Goal: democratize LLM evaluation research and build meaningful, generalizable methods.
Talk to us about data you'd like to contribute or request evaluations you want to see added to ποΈ DOVE!
Care about LLM evaluation? π€ π€
We bring you οΈοΈποΈ DOVE a massive (250M!) collection of LLMs outputsΒ
On different prompts, domains, tokens, models...
Join our community effort to expand it with YOUR model predictions & become a co-author!
π AI is changing the world. Is AI regulation on the right track? π€
While regulators rely on benchmarking π, we show why it cannot guarantee AI behavior:
arxiv.org/pdf/2501.15693
Excited about this multidisciplinary collaboration!
@gabistanovsky.bsky.social,
@rkeydar.bsky.social , Gadi Perl