Eliya Habba's Avatar

Eliya Habba

@eliyahabba

PhD student at Hebrew University #HebrewU #NLP

56
Followers
165
Following
7
Posts
21.11.2024
Joined
Posts Following

Latest posts by Eliya Habba @eliyahabba

Let’s build a more robust foundation for LLM evaluation!

A collaboration from @hebrewuniversity.bsky.social @nlphuji.bsky.social @IBMResearch and more:

@yperlitz.bsky.social @lchoshen.bsky.social @gabistanovsky.bsky.social

17.03.2025 14:43 πŸ‘ 5 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

3. Some instances are consistently easy or hard across ALL prompts, no matter how you prompt: models either always succeed or consistently fail.

17.03.2025 14:39 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

2. Selecting prompt characteristics (e.g., phrasing, enumerators) based on past examples helps efficiently find optimal prompts.

17.03.2025 14:38 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Key findings from πŸ•ŠοΈ DOVE:

1. Prompt sensitivity is HUGE! Performance varies dramatically with small changes (e. g. ➑ OLMo’s accuracy on HellaSwag ranges from 1% to 99%, simply by changing prompt elements like phrasing, enumerators, and answer order).

17.03.2025 14:38 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Goal: democratize LLM evaluation research and build meaningful, generalizable methods.

Talk to us about data you'd like to contribute or request evaluations you want to see added to πŸ•ŠοΈ DOVE!

17.03.2025 14:38 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

Care about LLM evaluation? πŸ€– πŸ€”

We bring you οΈοΈπŸ•ŠοΈ DOVE a massive (250M!) collection of LLMs outputsΒ 
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!

17.03.2025 14:37 πŸ‘ 11 πŸ” 3 πŸ’¬ 1 πŸ“Œ 2

🌍 AI is changing the world. Is AI regulation on the right track? πŸ€”

While regulators rely on benchmarking πŸ“Š, we show why it cannot guarantee AI behavior:
arxiv.org/pdf/2501.15693

Excited about this multidisciplinary collaboration!
@gabistanovsky.bsky.social,
@rkeydar.bsky.social , Gadi Perl

03.02.2025 09:00 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0