Xiaoyan Bai (@elenal3ai)

AI-assisted Reviewing is Necessary and Should be Open Peer review is facing a death spiral. AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open.

Peer review is facing a death spiral, and AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open. We built OpenAIReview: open AI reviewing for everyone, for the cost of a coffee.

openaireview.github.io/blog.html 🧵

09.03.2026 18:48 👍 20 🔁 7 💬 1 📌 4

Democracy depends on an informed electorate. But political issues and ballot measures can be confusing, obscuring the effects of one outcome versus another. And politics is personal. Once we make an initial decision, it can be hard to see things from "the other side."

19.02.2026 21:46 👍 4 🔁 2 💬 2 📌 2

Huge thanks to my collaborators Alexander Baumgartner and Haojia Sun, @ari-holtzman.bsky.social, and
@chenhaotan.bsky.social. We also thank Yonatan Belinkov, Tal Haklay, Austin Kozlowski, Jiachen Liu, and Tamar Rott Shaham for discussions. We are grateful to Modal for computing credits.
12/n

10.02.2026 20:12 👍 1 🔁 0 💬 0 📌 0

GitHub - ChicagoHAI/MechEvalAgent Contribute to ChicagoHAI/MechEvalAgent development by creating an account on GitHub.

Our repo: github.com/ChicagoHAI/M...
📜Our paper: elena-baixy.github.io/TheStoryisNo...
(Unfortunately, we are on hold @arxiv.bsky.social , will update with arXiv link later!)
11/n

10.02.2026 19:44 👍 2 🔁 0 💬 1 📌 0

🔮 Toward the future
AI reviewers require caution since they can hallucinate and struggle with instruction following. But we argue for this change: 📄 Story-based review → ⚙️ Execution-grounded evaluation
The story is not the science. MechEvalAgent is a step toward that future.
10/n

10.02.2026 19:44 👍 1 🔁 0 💬 1 📌 0

⚠️ Implications
Narrative-alone review is insufficient.
Passing MechEvalAgent doesn’t guarantee “good” research, but failing it reveals actionable weaknesses. Our results demonstrate that AI agents can raise the floor of evaluating research by exposing those weaknesses that humans often overlook.
9/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

⏱️ Efficiency matters

Humans took ~2.2 hours per evaluation, while MechEvalAgent finishes in <30 min for agent repos and ~1 hr for human repos
8/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

🧩 Examples of hidden failures

MechEvalAgent uncovered issues like:
❌ Missing or broken files preventing reproduction.
❌ Strong conclusion reached despite performance only slightly above baseline.
❌ Reported metrics deviating by >8% on rerun.
❌ Experiments claimed but never implemented.
7/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

🔥 MechEvalAgent vs Humans

MechEvalAgent achieves:
📌 80%+ agreement with expert reviewers
❗️Surfaces 51 additional issues humans missed.
Execution reveals problems a narrative-alone review simply cannot.
6/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

Even in human repos, execution breakdowns are common. Our ablation study proves that without execution, those failures are hard to catch.
5.5/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

🚨 Key finding: failures are everywhere

We evaluated 30 research outputs across replication tasks, open-ended research questions, and human-written repositories.
We observe that across projects: 93% fail reproducibility and 80% fail coherence.
5/n

10.02.2026 19:44 👍 0 🔁 0 💬 1 📌 0

🤖 Introducing MechEvalAgent

MechEvalAgent that checks research on three dimensions:
✅ Coherence
✅ Reproducibility
✅ Generalizability
We focus on mech-interp as a testbed because its claims are often testable through rerunning interventions. We use Claude Code and Scribe as our backbone.
4/n

10.02.2026 19:44 👍 1 🔁 0 💬 1 📌 0

🔍 Execution-Grounded Evaluation Framework

We propose the first framework that evaluates research outputs as:
📄 Narrative (plan + report + human input prompts)
💻 Execution resources (code + data + walkthrough)
Together, they enable verification beyond narrative.
3/n

10.02.2026 19:44 👍 1 🔁 0 💬 1 📌 0

⚖️ Why it matters

Science is facing reproducibility crises, and AI is accelerating research output faster than humans can evaluate.
Paper-only review can’t catch:
➕ Broken code
➕ Missing experiments
➕ Silent metric bugs
➕ Overclaimed conclusions
We need execution!
2/n

10.02.2026 19:44 👍 1 🔁 0 💬 1 📌 0

📖 ≠ 🧪 The Story is Not the Science.
Code is submitted but rarely executed during peer review—an issue likely to worsen with research agents. 🧑‍🔬
We introduce 𝐌𝐞𝐜𝐡𝐄𝐯𝐚𝐥𝐀𝐠𝐞𝐧𝐭, an execution-grounded evaluation of narrative + execution. 𝐕𝐞𝐫𝐢𝐟𝐲 𝐭𝐡𝐞 𝐬𝐜𝐢𝐞𝐧𝐜𝐞, 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐭𝐡𝐞 𝐬𝐭𝐨𝐫𝐲.
1/n

10.02.2026 19:44 👍 8 🔁 4 💬 2 📌 0

Why can’t powerful AIs learn basic multiplication? New research reveals why even state-of-the-art large language models stumble on seemingly easy tasks—and what it takes to fix it

Featured in UChicago News: @elenal3ai.bsky.social & @chenhaotan.bsky.social's research into why AI can write complex code but fails at 4-digit multiplication:
tinyurl.com/5ukvm7p7

13.01.2026 21:30 👍 2 🔁 1 💬 0 📌 0

Concept Incongruence: An Exploration of Time and Death in Role Playing Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to gener...

📑Paper: arxiv.org/abs/2505.14905
💻Website: elena-baixy.github.io/concept-inco...

24.11.2025 19:18 👍 0 🔁 0 💬 0 📌 0

Will be at #NeurIPS2025 presenting “Concept Incongruence”!

🦄🦆 Curious about a unicorn duck? Stop by, get one, and chat with us!

We made a new demo for detecting hidden conflicts in system prompts to spot “concept incongruence” for safer prompts.

🔗: github.com/ChicagoHAI/d...

🗓️ Dec 3 11AM - 2PM

24.11.2025 19:18 👍 6 🔁 1 💬 1 📌 1

✨We thank @boknilev.bsky.social for his insightful suggestions!

20.11.2025 21:46 👍 0 🔁 0 💬 0 📌 0

MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability | Notion Xiaoyan Bai

We want to make "AI doing science" something we can inspect and trust.
If you’re excited about grounded evaluation and want to push this forward, check out our blog and repo —contributions are welcome. 👇

🖊️ Blog: tinyurl.com/MechEvalAgents
🧑‍💻Repo: github.com/ChicagoHAI/M...

7/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

🚧 What remains hard and what comes next:
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.

6/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.

5/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization

4/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

2️⃣ A grounded evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?

3/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

We have two components.

1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable

2/n🧵

20.11.2025 21:46 👍 0 🔁 0 💬 1 📌 0

Research agents are getting smarter. They can write convincing PhD-level reports 🧑‍🔬

But has anyone checked if the way they find their results makes any sense?

Our framework, MechEvalAgents, verifies the science, not just the story 🤖

1/n🧵

20.11.2025 21:46 👍 3 🔁 0 💬 1 📌 1

✨We thank @boknilev.bsky.social for his insightful suggestions!

20.11.2025 21:37 👍 0 🔁 0 💬 0 📌 0

🚧 What remains hard and what comes next:
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n🧵

20.11.2025 21:37 👍 0 🔁 0 💬 1 📌 0

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n🧵

20.11.2025 21:37 👍 0 🔁 0 💬 1 📌 0

❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵

20.11.2025 21:37 👍 0 🔁 0 💬 1 📌 0

Xiaoyan Bai

Latest posts by Xiaoyan Bai @elenal3ai