Xiaoyan Bai's Avatar

Xiaoyan Bai

@elenal3ai

PhD @UChicagoCS / BE in CS @Umich / โœจAI/NLP transparency and interpretability/๐Ÿ“ท๐ŸŽจphotography painting

443
Followers
184
Following
71
Posts
15.11.2024
Joined
Posts Following

Latest posts by Xiaoyan Bai @elenal3ai

Preview
AI-assisted Reviewing is Necessary and Should be Open Peer review is facing a death spiral. AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open.

Peer review is facing a death spiral, and AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open. We built OpenAIReview: open AI reviewing for everyone, for the cost of a coffee.

openaireview.github.io/blog.html ๐Ÿงต

09.03.2026 18:48 ๐Ÿ‘ 20 ๐Ÿ” 7 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 4
Video thumbnail

Democracy depends on an informed electorate. But political issues and ballot measures can be confusing, obscuring the effects of one outcome versus another. And politics is personal. Once we make an initial decision, it can be hard to see things from "the other side."

19.02.2026 21:46 ๐Ÿ‘ 4 ๐Ÿ” 2 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 2

Huge thanks to my collaborators Alexander Baumgartner and Haojia Sun, @ari-holtzman.bsky.social, and
@chenhaotan.bsky.social. We also thank Yonatan Belinkov, Tal Haklay, Austin Kozlowski, Jiachen Liu, and Tamar Rott Shaham for discussions. We are grateful to Modal for computing credits.
12/n

10.02.2026 20:12 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
GitHub - ChicagoHAI/MechEvalAgent Contribute to ChicagoHAI/MechEvalAgent development by creating an account on GitHub.

Our repo: github.com/ChicagoHAI/M...
๐Ÿ“œOur paper: elena-baixy.github.io/TheStoryisNo...
(Unfortunately, we are on hold @arxiv.bsky.social , will update with arXiv link later!)
11/n

10.02.2026 19:44 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿ”ฎ Toward the future
AI reviewers require caution since they can hallucinate and struggle with instruction following. But we argue for this change: ๐Ÿ“„ Story-based review โ†’ โš™๏ธ Execution-grounded evaluation
The story is not the science. MechEvalAgent is a step toward that future.
10/n

10.02.2026 19:44 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โš ๏ธ Implications
Narrative-alone review is insufficient.
Passing MechEvalAgent doesnโ€™t guarantee โ€œgoodโ€ research, but failing it reveals actionable weaknesses. Our results demonstrate that AI agents can raise the floor of evaluating research by exposing those weaknesses that humans often overlook.
9/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โฑ๏ธ Efficiency matters

Humans took ~2.2 hours per evaluation, while MechEvalAgent finishes in <30 min for agent repos and ~1 hr for human repos
8/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿงฉ Examples of hidden failures

MechEvalAgent uncovered issues like:
โŒ Missing or broken files preventing reproduction.
โŒ Strong conclusion reached despite performance only slightly above baseline.
โŒ Reported metrics deviating by >8% on rerun.
โŒ Experiments claimed but never implemented.
7/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image

๐Ÿ”ฅ MechEvalAgent vs Humans

MechEvalAgent achieves:
๐Ÿ“Œ 80%+ agreement with expert reviewers
โ—๏ธSurfaces 51 additional issues humans missed.
Execution reveals problems a narrative-alone review simply cannot.
6/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Even in human repos, execution breakdowns are common. Our ablation study proves that without execution, those failures are hard to catch.
5.5/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿšจ Key finding: failures are everywhere

We evaluated 30 research outputs across replication tasks, open-ended research questions, and human-written repositories.
We observe that across projects: 93% fail reproducibility and 80% fail coherence.
5/n

10.02.2026 19:44 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿค– Introducing MechEvalAgent

MechEvalAgent that checks research on three dimensions:
โœ… Coherence
โœ… Reproducibility
โœ… Generalizability
We focus on mech-interp as a testbed because its claims are often testable through rerunning interventions. We use Claude Code and Scribe as our backbone.
4/n

10.02.2026 19:44 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿ” Execution-Grounded Evaluation Framework

We propose the first framework that evaluates research outputs as:
๐Ÿ“„ Narrative (plan + report + human input prompts)
๐Ÿ’ป Execution resources (code + data + walkthrough)
Together, they enable verification beyond narrative.
3/n

10.02.2026 19:44 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โš–๏ธ Why it matters

Science is facing reproducibility crises, and AI is accelerating research output faster than humans can evaluate.
Paper-only review canโ€™t catch:
โž• Broken code
โž• Missing experiments
โž• Silent metric bugs
โž• Overclaimed conclusions
We need execution!
2/n

10.02.2026 19:44 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿ“– โ‰  ๐Ÿงช The Story is Not the Science.
Code is submitted but rarely executed during peer reviewโ€”an issue likely to worsen with research agents. ๐Ÿง‘โ€๐Ÿ”ฌ
We introduce ๐Œ๐ž๐œ๐ก๐„๐ฏ๐š๐ฅ๐€๐ ๐ž๐ง๐ญ, an execution-grounded evaluation of narrative + execution. ๐•๐ž๐ซ๐ข๐Ÿ๐ฒ ๐ญ๐ก๐ž ๐ฌ๐œ๐ข๐ž๐ง๐œ๐ž, ๐ง๐จ๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐ญ๐ก๐ž ๐ฌ๐ญ๐จ๐ซ๐ฒ.
1/n

10.02.2026 19:44 ๐Ÿ‘ 8 ๐Ÿ” 4 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0
Preview
Why canโ€™t powerful AIs learn basic multiplication? New research reveals why even state-of-the-art large language models stumble on seemingly easy tasksโ€”and what it takes to fix it

Featured in UChicago News: @elenal3ai.bsky.social & @chenhaotan.bsky.social's research into why AI can write complex code but fails at 4-digit multiplication:
tinyurl.com/5ukvm7p7

13.01.2026 21:30 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Concept Incongruence: An Exploration of Time and Death in Role Playing Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to gener...

๐Ÿ“‘Paper: arxiv.org/abs/2505.14905
๐Ÿ’ปWebsite: elena-baixy.github.io/concept-inco...

24.11.2025 19:18 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image Post image

Will be at #NeurIPS2025 presenting โ€œConcept Incongruenceโ€!

๐Ÿฆ„๐Ÿฆ† Curious about a unicorn duck? Stop by, get one, and chat with us!

We made a new demo for detecting hidden conflicts in system prompts to spot โ€œconcept incongruenceโ€ for safer prompts.

๐Ÿ”—: github.com/ChicagoHAI/d...

๐Ÿ—“๏ธ Dec 3 11AM - 2PM

24.11.2025 19:18 ๐Ÿ‘ 6 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

โœจWe thank @boknilev.bsky.social for his insightful suggestions!

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability | Notion Xiaoyan Bai

We want to make "AI doing science" something we can inspect and trust.
If youโ€™re excited about grounded evaluation and want to push this forward, check out our blog and repo โ€”contributions are welcome. ๐Ÿ‘‡

๐Ÿ–Š๏ธ Blog: tinyurl.com/MechEvalAgents
๐Ÿง‘โ€๐Ÿ’ปRepo: github.com/ChicagoHAI/M...

7/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

๐Ÿšง What remains hard and what comes next:
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.

6/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

A failure example: The agent โ€œvalidatedโ€ its circuit by checking whether the neurons it used happened to be on a list of names we provided.

5/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โ—๏ธWhat we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization

4/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

2๏ธโƒฃ A grounded evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?

3/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

We have two components.

1๏ธโƒฃ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan โ†’ Code โ†’ Walkthrough โ†’ Report.
This makes agents comparable and makes their reasoning trace inspectable

2/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image Post image Post image

Research agents are getting smarter. They can write convincing PhD-level reports ๐Ÿง‘โ€๐Ÿ”ฌ

But has anyone checked if the way they find their results makes any sense?

Our framework, MechEvalAgents, verifies the science, not just the story ๐Ÿค–

1/n๐Ÿงต

20.11.2025 21:46 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

โœจWe thank @boknilev.bsky.social for his insightful suggestions!

20.11.2025 21:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿšง What remains hard and what comes next:
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n๐Ÿงต

20.11.2025 21:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

A failure example: The agent โ€œvalidatedโ€ its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n๐Ÿงต

20.11.2025 21:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

โ—๏ธWhat we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n๐Ÿงต

20.11.2025 21:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0