Lexin Zhou's Avatar

Lexin Zhou

@lexinzhou

PhD candidate at Princeton | Research on AI Evaluation, Social Computing, AI Safety, RL | https://lexzhou.github.io

20
Followers
17
Following
25
Posts
28.09.2024
Joined
Posts Following

Latest posts by Lexin Zhou @lexinzhou

Thrilled to share this accessible MSR blogpost that summarizes our latest work on building a Science of AI Evaluation, where we manage to both reliably explain and predict success/failure of general-purpose AI models on new, unforeseen tasks and environments!

13.05.2025 04:08 πŸ‘ 5 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Active updates of the platform will occur continuously (including the scales, battery, results, blogposts, tutorials, etc), underpinning the reliable deployment of AI with both explanatory and predictive power in the years to come.

Contributions and feedback are welcome!

14.03.2025 03:37 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

🚨To continuously foster conceptual & technical innovations for a science of AI Evaluation:

An open collaborative community is initiated by Leverhulme Centre for the Future of Intelligence, to adopt and extend our novel methodology.

Join us: kinds-of-intelligence-cfi.github.io/ADELE!

14.03.2025 03:37 πŸ‘ 4 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

To better understand why this matters in high-stakes contexts, you can also check out our previous work. We discuss why predicting model performance (e.g., failures on out-of-distribution languages in machine translation) remains essential in legal contexts.

11.03.2025 20:07 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Understanding and extrapolating benchmark results will become essential for effective policymaking and informing users. New work identifies indicators that have high predictive power in modeling LLM performance. Excited for it to be out!

11.03.2025 20:07 πŸ‘ 11 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0

19/ To end off, huge thanks to Serina Chang, Miri Zilka, Jianxun Lian and Chengzu Li for their valuable help and feedback at certain stages of the project!

11.03.2025 18:36 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

18.2/ Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, @peterhenderson.bsky.social l, @sherrytswu.bsky.social , Patrick C. Kyllonen, @lucycheke.bsky.social , Xing Xie, JosΓ© HernΓ‘ndez-Orallo

11.03.2025 18:35 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

18.1/ Tagging my wonderful collaborators who have been highly constructive in this work:

Lorenzo Pacchiardi, Fernando MartΓ­nez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo SΓ‘nchez-GarcΓ­a, ...

11.03.2025 18:35 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
The AI Evaluation Substack | Substack A monthly digest of the latest developments, research trends and key initiatives in the realm of AI evaluation. Click to read The AI Evaluation Substack, a Substack publication with hundreds of subscr...

17/ Paper: arxiv.org/pdf/2503.06378

Newsletters: If you are drawn to everything relevant to AI Evaluation and want to stay informed, please subscribe to our monthly AI Evaluation Digest newsletter! (aievaluation.substack.com)

11.03.2025 18:31 πŸ‘ 1 πŸ” 1 πŸ’¬ 2 πŸ“Œ 0

16/ Future work to improve our methodology and thus AI evaluation:

- Analyse multimodal systems and embodied AI
- Turn the demand level 5+ into 5-10
- Enhance the coverage of instances at demand level 5+
- We encourage collaborative efforts on extending our methodology. Contact: jh2135@cam.ac.uk

11.03.2025 18:29 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

15/ The scales, rubrics, battery, and results presented here mark a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. It will be instrumentalised through a platform in the coming years, ready to explain and predict the performance and safety of AI systems.

11.03.2025 18:29 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

14/ Takeaways on our novel methodology:

- General scales (stable to SOTA/frontiers in AI, no saturation!)​
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles) ​
- Predictive power at instance level (especially OOD!)
- Fully automated procedure​

11.03.2025 18:28 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

13/ Even better, we build a Random Forest (RF) classifier fed with the 18 demand levels to predict the performance of LLMs at instance-level. This yields high predictive power (high AUROC and nearly perfect calibration!) in-distribution and out-of-distribution, outperforming black-box predictors.

11.03.2025 18:26 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

12/ On predictive power: We can map these interpretable ability profiles with the demand profiles of benchmarks or individual instances, to anticipate the performance of LLMs on them: the larger the supremacy of (model) abilities over (task) demands is, the more likely the model will succeed.

11.03.2025 18:25 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

11/ Takeaways from ability profiles:

-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in β€˜reasoning’ models

11.03.2025 18:22 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

10/ Beyond SCCs, we can summarise SCCs by computing the ability score for each dimension, defined as the x-value where success probability is 0.5 (the point with the maximum slope/information in a SCC) following the tradition in psychometrics. This yields many insights under our new evaluation:πŸ‘‡

11.03.2025 18:21 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

9/ The SCCs of certain dimensions are steep, which explains (and predicts) success very well for instances in the low and high ranges. In contrast, ​SCCs of other dimensions are flatter and show strong differences between LLMs, i.e., lower discrimination power to differentiate success and failures

11.03.2025 18:21 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

8/ To evaluate abilities, we show the subject characteristic curve (SCC) for each dimension: the probability of success as a logistic function of demand levels. We use dominant slicing: for level k of the target dimension, all other dimensions<=k.

Here's an example SCC, but next post has all SCCs.

11.03.2025 18:20 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

7/ With our demand levels, we can additionally infer the actual ability profiles of LLMs. This is robust to changing the difficulty distribution of test instances, unlike brittle benchmark averages scores (e.g. one model achieves 80% accuracy on MATH but yields only 20% accuracy on AIME).

11.03.2025 18:19 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

6/ Surprisingly, by inspecting demand levels: All these 20 benchmarks from recent top AI/NLP conferences lack construct validity: not measuring what they claim to measure (lacking specificity) or tend to only include intermediate difficulties for the target ability scale (lacking sensitivity)

11.03.2025 18:19 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

5/ We annotate demand levels of 18 dimensions for 16K instances sampled from 63 tasks on 20 benchmarks. This forms the Annotated-Demand-Levels (ADeLe) battery, which elegantly places task instances of many different benchmarks in the same commensurate space!

11.03.2025 18:18 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

4/ For example, in the natural sciences knowledge (KNn) rubric, we use education levels to represent the demand levels from 0 to 5+

A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.

Similar/related principles are applied to other rubrics.

11.03.2025 18:17 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

3/To address these issues, we craft 18 novel rubrics to annotate demand levels (0 to 5+) for 18 general scales from a taxonomy of cognitive abilities, focusing on LLMs:

Primordial: 11 cognitive capabilities​
Knowledge: 5 branches of knowledge​
Extraneous: 2 other elements making task difficult​

11.03.2025 18:16 πŸ‘ 1 πŸ” 1 πŸ’¬ 2 πŸ“Œ 0

2/ Motivation: Current AI evaluation paradigm has struggles:

- Can’t robustly explain and predict where an AI can be deployed reliably and safely
- Can’t precisely explain what benchmarks really measure
- Incomparable aggregate scores between benchmarks
- Benchmark saturation
- Changing scales
- …

11.03.2025 18:14 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

Thrilled to unlock AI Evaluation with explanatory and predictive power through general ability scales!

With a new methodology to
-Explain what common benchmarks really measure
-Extract explainable ability profiles of AI systems
-Predict performance for new task instances, in & out-of-distribution
🧡

11.03.2025 18:12 πŸ‘ 6 πŸ” 2 πŸ’¬ 1 πŸ“Œ 2
x.com

Thanks for sharing. In case this is of interest, here's a summary thread on X! x.com/lexin_zhou/s...

03.12.2024 14:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

aΓ±adir

28.09.2024 21:26 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0