EvalEval Coalition's Avatar

EvalEval Coalition

@eval-eval

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations. https://evalevalai.com/

103
Followers
8
Following
46
Posts
12.06.2025
Joined
Posts Following

Latest posts by EvalEval Coalition @eval-eval

Preview
Every Eval Ever: Toward a Common Language for AI Eval Reporting The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. Weโ€™re working to resolve AI evaluation fragmentation, improving formatting, settings, and...

Read the full announcement: evalevalai.com/infrastructu...
Shared Task: evalevalai.com/events/share...
Project Webpage: evalevalai.com/projects/eve...

#AIEvaluation #EvalEval

17.02.2026 15:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Thankful to our partners for the feedback: CAISI, AIEleuther, Huggingface, NomaSecurity, TrustibleAI, InspectAI, Meridian, AVERI, CIP, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, IBM Research ๐Ÿค

17.02.2026 15:00 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

How can you help?

We are launching a shared task alongside our workshop at @aclmeeting.bsky.social

โ†’ Two tracks: public + proprietary eval data
โ†’ Co-authorship for qualifying contributors
โ†’ Workshop at ACL 2026 (San Diego)
โ†’ Deadline: May 1, 2026 ๐Ÿ“…

17.02.2026 15:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

What we built:

๐Ÿ“‹ Metadata schema for cross-framework comparison
๐Ÿ”ง Validation via Hugging Face Jobs
๐Ÿ”Œ Converters (Inspect AI, HELM, lm-eval-harness)
๐Ÿ“Š Community repo organized by benchmark/model/run

โœจ Captures scores AND context: settings, prompts, example-level data

17.02.2026 15:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

This has real costs!

๐Ÿ”ฌ Signal buried in noise, can't tell if differences reflect model capability or just setup
๐Ÿ“ฆ Evaluation debt piles up silently across the ecosystem
๐Ÿ”ŽRedundant re-runs of expensive evaluations

๐ŸŒŸThat's where Every Eval Ever comes

17.02.2026 15:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿค”Consider the scenario

LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU

Same model. Same benchmark name. Different prompts, settings, extraction methods.

๐Ÿ’กWhich score is right? Both? Neither? We can't compare. ๐Ÿคท

17.02.2026 15:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Every Eval Ever | EvalEval Coalition

๐Ÿš€ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting ๐Ÿš€

A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch ๐Ÿ”ง

A tale of broken AI evals ๐Ÿงต๐Ÿ‘‡

evalevalai.com/projects/eve...

17.02.2026 15:00 ๐Ÿ‘ 11 ๐Ÿ” 4 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 4
https://evalevalai.com/events/2026-acl-workshop/

We're seeking submissions on:

๐Ÿ” Evaluation validity & reliability
๐ŸŒ Sociotechnical impacts
โš™๏ธ Infrastructure & costs
๐Ÿค Community-centered approaches

Full papers (6-8 pages), short papers (4 pages) or tiny papers (2 pages) welcome.

Check out the full CFP: t.co/JRSr50V7Y6

17.02.2026 00:21 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿšจ The next edition of EvalEval Workshop is coming to
@aclmeeting.bsky.social 2026!

๐Ÿง  Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" ๐ŸŽ‡

๐Ÿ“ข CFP is now open!!! More details โฌ

๐Ÿ“ San Diego
๐Ÿ“ Submission deadline: Mar 12, 2026

17.02.2026 00:21 ๐Ÿ‘ 6 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Thank you to everyone who attended, presented at, spoke at, or helped organize this workshop. You rock! Special thanks to the UK AI Security Institute for cohosting and their support.

10.12.2025 22:59 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations!

We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!

10.12.2025 22:55 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“œPaper: arxiv.org/pdf/2511.056...
๐Ÿ“Blog: tinyurl.com/blogAI1

๐ŸคAt EvalEval, we are a coalition of researchers working towards better AI evals. Interested in joining us? Check out: evalevalai.com 7/7 ๐Ÿงต

13.11.2025 13:59 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Continued..

๐Ÿ“‰ Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
๐Ÿง‘โ€๐Ÿ’ป Sensitive content gets the most attention, as itโ€™s easier to define and measure

๐Ÿ›ก๏ธSolution? Standardized reporting & safety policies (6/7)

13.11.2025 13:59 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Key Takeaways:

โ›”๏ธ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
๐Ÿ“‰ On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
๐ŸŽฏ Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)

13.11.2025 13:59 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ’ก We also interviewed developers from for-profit and non-profit orgs to understand why some disclosures happen and why others donโ€™t.

๐Ÿ’ฌ TLDR: Incentives and constraints shape reporting (4/7)

13.11.2025 13:59 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ“Š What we did:

๐Ÿ”Ž Analyzed 186 first-party release reports from model developers & 183 post-release evaluations (third-party)
๐Ÿ“ Scored 7 social impact dimensions: bias, harmful content, performance disparities, environmental costs, privacy, financial costs, & labor (3/7)

13.11.2025 13:59 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

While general capability evaluations are common, social impact assessments, covering bias, fairness, and privacy, etc., are often fragmented or missing. ๐Ÿง 

๐ŸŽฏOur goal: Explore the AI Eval landscape to answer who evaluates what and identify gaps in social impact evals!! (2/7)

13.11.2025 13:59 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿšจ AI keeps scaling, but social impact evaluations arenโ€™tโ€“and the data proves it ๐Ÿšจ

Our new paper, ๐Ÿ“Žโ€œWho Evaluates AIโ€™s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,โ€ analyzes hundreds of evaluation reports and reveals major blind spots โ€ผ๏ธ๐Ÿงต (1/7)

13.11.2025 13:59 ๐Ÿ‘ 11 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Note: General registration is constrained by space capacity! Please note that attendance will be confirmed by the organizers based on space availability. Accepted posters will be invited to register for free and attend the workshop in person!

06.11.2025 21:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿ“ฎ We are inviting students and early-stage researchers to submit an Abstract (Max 500 words) to be presented as posters during interactive session. Submit here: tinyurl.com/AbsEval

We have a rock-star lineup of AI researchers and an amazing program. Please RSVP at the earliest! Stay tuned!

06.11.2025 21:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

๐Ÿšจ EvalEval is back - now in San Diego!๐Ÿšจ

๐Ÿง  Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

๐Ÿ“… Dec 8, 2025
๐Ÿ“ Abstract due: Nov 20, 2025

Details below! โฌ‡๏ธ
evalevalai.com/events/works...

06.11.2025 21:19 ๐Ÿ‘ 3 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

๐Ÿ’กThis paper was brought to you as part of our spotlight series featuring papers on evaluation methods & datasets, the science of evaluation, and many more.

๐Ÿ“ธInterested in working on better AI evals? Join us: evalevalai.com

31.10.2025 15:47 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿšซ The approach also avoids mislabeled data and delays benchmark saturation, continuing to distinguish model improvements even at high performance levels.

๐Ÿ“‘Read more: arxiv.org/abs/2509.11106

31.10.2025 15:47 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿ“ŠResults & Findings

๐Ÿงช Experiments across 6 LLMs and 6 major benchmarks:

๐ŸƒFluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
โšก๏ธIt achieves lower variance with up to 50ร— fewer items needed!!

31.10.2025 15:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

It combines two key ideas:

โœ๏ธItem Response Theory: Models LLM performance in a latent ability space based on item difficulty and discrimination across models
๐ŸงจDynamic Item Selection: Adaptive benchmarking-weaker models get easier items, while stronger models face harder ones

31.10.2025 15:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿ”How to address this? ๐Ÿค”

๐ŸงฉFluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each modelโ€™s capability level.

Continued...๐Ÿ‘‡

31.10.2025 15:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โš ๏ธ Evaluation results can be noisy and prone to variance & labeling errors.
๐ŸงฑAs models advance, benchmarks tend to saturate quickly, reducing their longterm usefulness.
๐ŸชƒExisting approaches typically tackle just one of these problems (e.g., efficiency or validity)

What nowโ‰๏ธ

31.10.2025 15:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ’ฃCurrent SOTA benchmarking setups face several systematic issues:

๐Ÿ“‰Itโ€™s often unclear which benchmark(s) to choose, while evaluating on all available ones is too expensive, inefficient, and not always aligned with the intended capabilities we want to measure.

More ๐Ÿ‘‡๐Ÿ‘‡

31.10.2025 15:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

โœจ Weekly AI Evaluation Paper Spotlight โœจ

๐Ÿค”Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?

๐Ÿ–‡๏ธ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models

31.10.2025 15:47 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
EvalEval Coalition We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

๐Ÿ’กThis is part of our new weekly spotlight series that will feature papers on evaluation methods & datasets, the science of evaluation, and many more.

๐Ÿ“ท Interested in working on better AI evals? Check out: evalevalai.com

24.10.2025 16:44 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0