Serina Chang (@serinachang5)

Blog post on exciting research happening at MSR, including our recent work ChatBench on human-AI vs AI-alone evaluation!

25.04.2025 00:05 👍 0 🔁 0 💬 0 📌 0

Looking forward to taking part in this CHI'25 panel organized by @angelhwang.bsky.social !!

22.04.2025 18:55 👍 5 🔁 0 💬 0 📌 0

Check out ChatBench online and see our paper for analyses of the user-AI conversations! Thanks to my fantastic collaborators at
@msftresearch.bsky.social, @ashtonanderson.bsky.social and @jakehofman.bsky.social! 9/

ChatBench: huggingface.co/datasets/mic...
Paper: arxiv.org/abs/2504.07114

11.04.2025 17:57 👍 0 🔁 0 💬 0 📌 0

Fine-tuning greatly improves the simulator’s ability to estimate real user-AI accs, increasing correlation on unseen questions by >20 pts. Our results demonstrate the promise of simulation to scale interactive eval, but also the need to test simulators on real human behavior. 8/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

These results motivate the need to incorporate human interaction into AI evaluation. However, how do we do this at scale? We propose an LLM-based user simulator and transform user-AI conversations + answers from ChatBench into supervised fine-tuning data for the simulator. 7/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

Model-level conclusions also change: what seemed like a large gap between GPT-4o and Llama-3.1-8b on AI-alone (25 pts) shrinks to less than 10 pts after incorporating user interactions, which could impact real-world decisions, where a smaller lighter model might be preferred. 6/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

Across subjects, models, AI-alone methods, and user-AI conditions, AI-alone fails to predict user-AI accuracy. Letter-only is especially bad with a mean gap of 21 pts. Free-text is better, with a mean gap of 10 pts, but still differs significantly from user-AI in many cases. 5/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

For AI-alone, we test (1) common letter-only methods that require the model to answer with a single letter, (2) our free-text method where the model responds without any constraints, then GPT-4o extracts an answer, simulating a user copy+pasting then looking for an answer. 4/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

We conduct a large-scale user study on Prolific, collecting data across 5 MMLU datasets (physics, moral reasoning, three levels of math), 2 models (GPT-4o & Llama-3.1-8b), and 2 user-AI conditions (user answers first vs directly with AI) → 7.3K user-AI conversations. 3/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

Standard benchmarks test AI on its own (“AI-alone”) on static questions, missing human variability, interactivity, and writing style. We bring benchmarks to life by seeding human users with the benchmark question and having them interact with the LLM to answer their question. 2/

11.04.2025 17:57 👍 0 🔁 0 💬 1 📌 0

1st post on bsky!

What happens when a static benchmark comes to life? ✨ Introducing ChatBench, a large-scale user study where we *converted* MMLU questions into thousands of user-AI conversations. Then, we trained a user simulator on ChatBench to generate user-AI outcomes on unseen questions. 1/ 🧵

11.04.2025 17:57 👍 5 🔁 1 💬 1 📌 0

Thanks Johan!! :)

09.04.2025 23:24 👍 1 🔁 0 💬 0 📌 0

Check out ChatBench, our new paper+dataset. We turned AI benchmarks into user-AI chats and show that AI-alone evals often fail to predict how real humans perform with AI.
@serinachang5.bsky.social @ashtonanderson.bsky.social

serinachang5.github.io/assets/files...
huggingface.co/datasets/mic...

09.04.2025 21:29 👍 10 🔁 2 💬 0 📌 1

Serina Chang

Latest posts by Serina Chang @serinachang5