Blog post on exciting research happening at MSR, including our recent work ChatBench on human-AI vs AI-alone evaluation!
@serinachang5
Incoming Assistant Professor at UC Berkeley in CS and Computational Precision Health. Postdoc at Microsoft Research, PhD in CS at Stanford. Research in AI, graphs, public health, and computational social science. https://serinachang5.github.io/
Blog post on exciting research happening at MSR, including our recent work ChatBench on human-AI vs AI-alone evaluation!
Looking forward to taking part in this CHI'25 panel organized by @angelhwang.bsky.social !!
Check out ChatBench online and see our paper for analyses of the user-AI conversations! Thanks to my fantastic collaborators at
@msftresearch.bsky.social, @ashtonanderson.bsky.social and @jakehofman.bsky.social! 9/
ChatBench: huggingface.co/datasets/mic...
Paper: arxiv.org/abs/2504.07114
Fine-tuning greatly improves the simulatorβs ability to estimate real user-AI accs, increasing correlation on unseen questions by >20 pts. Our results demonstrate the promise of simulation to scale interactive eval, but also the need to test simulators on real human behavior. 8/
These results motivate the need to incorporate human interaction into AI evaluation. However, how do we do this at scale? We propose an LLM-based user simulator and transform user-AI conversations + answers from ChatBench into supervised fine-tuning data for the simulator. 7/
Model-level conclusions also change: what seemed like a large gap between GPT-4o and Llama-3.1-8b on AI-alone (25 pts) shrinks to less than 10 pts after incorporating user interactions, which could impact real-world decisions, where a smaller lighter model might be preferred. 6/
Across subjects, models, AI-alone methods, and user-AI conditions, AI-alone fails to predict user-AI accuracy. Letter-only is especially bad with a mean gap of 21 pts. Free-text is better, with a mean gap of 10 pts, but still differs significantly from user-AI in many cases. 5/
For AI-alone, we test (1) common letter-only methods that require the model to answer with a single letter, (2) our free-text method where the model responds without any constraints, then GPT-4o extracts an answer, simulating a user copy+pasting then looking for an answer. 4/
We conduct a large-scale user study on Prolific, collecting data across 5 MMLU datasets (physics, moral reasoning, three levels of math), 2 models (GPT-4o & Llama-3.1-8b), and 2 user-AI conditions (user answers first vs directly with AI) β 7.3K user-AI conversations. 3/
Standard benchmarks test AI on its own (βAI-aloneβ) on static questions, missing human variability, interactivity, and writing style. We bring benchmarks to life by seeding human users with the benchmark question and having them interact with the LLM to answer their question. 2/
1st post on bsky!
What happens when a static benchmark comes to life? β¨ Introducing ChatBench, a large-scale user study where we *converted* MMLU questions into thousands of user-AI conversations. Then, we trained a user simulator on ChatBench to generate user-AI outcomes on unseen questions. 1/ π§΅
Thanks Johan!! :)
Check out ChatBench, our new paper+dataset. We turned AI benchmarks into user-AI chats and show that AI-alone evals often fail to predict how real humans perform with AI.
@serinachang5.bsky.social @ashtonanderson.bsky.social
serinachang5.github.io/assets/files...
huggingface.co/datasets/mic...