AI-assisted Reviewing is Necessary and Should be Open
Peer review is facing a death spiral. AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open.
Peer review is facing a death spiral, and AI production tools are speeding it up. AI-assisted reviewing is necessary and should be open. We built OpenAIReview: open AI reviewing for everyone, for the cost of a coffee.
openaireview.github.io/blog.html ๐งต
09.03.2026 18:48
๐ 20
๐ 7
๐ฌ 1
๐ 4
Democracy depends on an informed electorate. But political issues and ballot measures can be confusing, obscuring the effects of one outcome versus another. And politics is personal. Once we make an initial decision, it can be hard to see things from "the other side."
19.02.2026 21:46
๐ 4
๐ 2
๐ฌ 2
๐ 2
Huge thanks to my collaborators Alexander Baumgartner and Haojia Sun, @ari-holtzman.bsky.social, and
@chenhaotan.bsky.social. We also thank Yonatan Belinkov, Tal Haklay, Austin Kozlowski, Jiachen Liu, and Tamar Rott Shaham for discussions. We are grateful to Modal for computing credits.
12/n
10.02.2026 20:12
๐ 1
๐ 0
๐ฌ 0
๐ 0
GitHub - ChicagoHAI/MechEvalAgent
Contribute to ChicagoHAI/MechEvalAgent development by creating an account on GitHub.
Our repo: github.com/ChicagoHAI/M...
๐Our paper: elena-baixy.github.io/TheStoryisNo...
(Unfortunately, we are on hold @arxiv.bsky.social , will update with arXiv link later!)
11/n
10.02.2026 19:44
๐ 2
๐ 0
๐ฌ 1
๐ 0
๐ฎ Toward the future
AI reviewers require caution since they can hallucinate and struggle with instruction following. But we argue for this change: ๐ Story-based review โ โ๏ธ Execution-grounded evaluation
The story is not the science. MechEvalAgent is a step toward that future.
10/n
10.02.2026 19:44
๐ 1
๐ 0
๐ฌ 1
๐ 0
โ ๏ธ Implications
Narrative-alone review is insufficient.
Passing MechEvalAgent doesnโt guarantee โgoodโ research, but failing it reveals actionable weaknesses. Our results demonstrate that AI agents can raise the floor of evaluating research by exposing those weaknesses that humans often overlook.
9/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
โฑ๏ธ Efficiency matters
Humans took ~2.2 hours per evaluation, while MechEvalAgent finishes in <30 min for agent repos and ~1 hr for human repos
8/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐งฉ Examples of hidden failures
MechEvalAgent uncovered issues like:
โ Missing or broken files preventing reproduction.
โ Strong conclusion reached despite performance only slightly above baseline.
โ Reported metrics deviating by >8% on rerun.
โ Experiments claimed but never implemented.
7/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐ฅ MechEvalAgent vs Humans
MechEvalAgent achieves:
๐ 80%+ agreement with expert reviewers
โ๏ธSurfaces 51 additional issues humans missed.
Execution reveals problems a narrative-alone review simply cannot.
6/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
Even in human repos, execution breakdowns are common. Our ablation study proves that without execution, those failures are hard to catch.
5.5/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐จ Key finding: failures are everywhere
We evaluated 30 research outputs across replication tasks, open-ended research questions, and human-written repositories.
We observe that across projects: 93% fail reproducibility and 80% fail coherence.
5/n
10.02.2026 19:44
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐ค Introducing MechEvalAgent
MechEvalAgent that checks research on three dimensions:
โ
Coherence
โ
Reproducibility
โ
Generalizability
We focus on mech-interp as a testbed because its claims are often testable through rerunning interventions. We use Claude Code and Scribe as our backbone.
4/n
10.02.2026 19:44
๐ 1
๐ 0
๐ฌ 1
๐ 0
๐ Execution-Grounded Evaluation Framework
We propose the first framework that evaluates research outputs as:
๐ Narrative (plan + report + human input prompts)
๐ป Execution resources (code + data + walkthrough)
Together, they enable verification beyond narrative.
3/n
10.02.2026 19:44
๐ 1
๐ 0
๐ฌ 1
๐ 0
โ๏ธ Why it matters
Science is facing reproducibility crises, and AI is accelerating research output faster than humans can evaluate.
Paper-only review canโt catch:
โ Broken code
โ Missing experiments
โ Silent metric bugs
โ Overclaimed conclusions
We need execution!
2/n
10.02.2026 19:44
๐ 1
๐ 0
๐ฌ 1
๐ 0
๐ โ ๐งช The Story is Not the Science.
Code is submitted but rarely executed during peer reviewโan issue likely to worsen with research agents. ๐งโ๐ฌ
We introduce ๐๐๐๐ก๐๐ฏ๐๐ฅ๐๐ ๐๐ง๐ญ, an execution-grounded evaluation of narrative + execution. ๐๐๐ซ๐ข๐๐ฒ ๐ญ๐ก๐ ๐ฌ๐๐ข๐๐ง๐๐, ๐ง๐จ๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐ญ๐ก๐ ๐ฌ๐ญ๐จ๐ซ๐ฒ.
1/n
10.02.2026 19:44
๐ 8
๐ 4
๐ฌ 2
๐ 0
Why canโt powerful AIs learn basic multiplication?
New research reveals why even state-of-the-art large language models stumble on seemingly easy tasksโand what it takes to fix it
Featured in UChicago News: @elenal3ai.bsky.social & @chenhaotan.bsky.social's research into why AI can write complex code but fails at 4-digit multiplication:
tinyurl.com/5ukvm7p7
13.01.2026 21:30
๐ 2
๐ 1
๐ฌ 0
๐ 0
Will be at #NeurIPS2025 presenting โConcept Incongruenceโ!
๐ฆ๐ฆ Curious about a unicorn duck? Stop by, get one, and chat with us!
We made a new demo for detecting hidden conflicts in system prompts to spot โconcept incongruenceโ for safer prompts.
๐: github.com/ChicagoHAI/d...
๐๏ธ Dec 3 11AM - 2PM
24.11.2025 19:18
๐ 6
๐ 1
๐ฌ 1
๐ 1
โจWe thank @boknilev.bsky.social for his insightful suggestions!
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 0
๐ 0
MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability | Notion
Xiaoyan Bai
We want to make "AI doing science" something we can inspect and trust.
If youโre excited about grounded evaluation and want to push this forward, check out our blog and repo โcontributions are welcome. ๐
๐๏ธ Blog: tinyurl.com/MechEvalAgents
๐งโ๐ปRepo: github.com/ChicagoHAI/M...
7/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐ง What remains hard and what comes next:
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.
6/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
A failure example: The agent โvalidatedโ its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
โ๏ธWhat we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
2๏ธโฃ A grounded evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?
3/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
We have two components.
1๏ธโฃ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan โ Code โ Walkthrough โ Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n๐งต
20.11.2025 21:46
๐ 0
๐ 0
๐ฌ 1
๐ 0
โจWe thank @boknilev.bsky.social for his insightful suggestions!
20.11.2025 21:37
๐ 0
๐ 0
๐ฌ 0
๐ 0
๐ง What remains hard and what comes next:
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n๐งต
20.11.2025 21:37
๐ 0
๐ 0
๐ฌ 1
๐ 0
A failure example: The agent โvalidatedโ its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n๐งต
20.11.2025 21:37
๐ 0
๐ 0
๐ฌ 1
๐ 0
โ๏ธWhat we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n๐งต
20.11.2025 21:37
๐ 0
๐ 0
๐ฌ 1
๐ 0