Thank you to the Frontier Model Forum, Sentinel Bio, and @packardfdn.bsky.social for supporting our work and to our advisory board.
Thank you to the Frontier Model Forum, Sentinel Bio, and @packardfdn.bsky.social for supporting our work and to our advisory board.
Shout out to Shen Zhou Hong, @alex-kleinman.bsky.social, Alyssa Mathiowetz, @adamhowes.bsky.social, @xrg.bsky.social, @lucarighetti.bsky.social, Joe Torres, Julian Cohen, Suveer Ganta, Deepika Pahari, Alex Letizia
You can read more here:
๐ Blog post: activesite.substack.com/p/rct
๐ arXiv Preprint: arxiv.org/abs/2602.16703
๐ฎ Predictions from @research-fri.bsky.social: forecastingresearch.substack.com/p/how-well-...
We're actively hiring for scientists and operators!
We especially want to find a Head of Ops to help build an engine to repeat this study regularly and develop entirely new ones.
jobs.ashbyhq.com/activesite
Importantly: this is a snapshot of mid-2025 novice and LLM performance.
Results could change as new LLMs become more capable, easier to use in the lab, and as average elicitation skill improves.
As models evolve, we aim to continue tracking how people use frontier AI in biology.
How good were participants at using LLMs?
~40% of participants never uploaded images to LLMs.
Interestingly, both arms mentioned YouTube most often as helpful.
How reliable were LLMs in the hands of novices?
LLM transcripts revealed that models can still make mistakes, especially in molecular cloning.
LLMs led participants to move quicker (Panel A) but often not with the correct materials (Panel B).
It's hard to compress all that into a single statistic.
But one way is by using a Bayesian model, which suggests LLMs give a ~1.4x boost on a "typical" wet-lab task.
Fundamentally, we're confident that there wasn't a large LLM slow-down or speed-up (95% CrI: 0.7xโ2.6x).
But there are some signs LLMs were useful.
LLM participants had higher success on 4 out of 5 tasks, most notably in cell culture (69% vs. 55%; P = 0.06).
LLM participants also advanced further within a task even if they didn't finish within the study period (odds >80%).
Our primary outcome: were LLM users more likely to complete all three of the core tasks *together*?
Only ~5% of the LLM arm and ~7% of the Internet arm completed all three.
No significant difference โ and far lower than experts predicted.
The study was the largest and longest of its kind: 153 participants with minimal lab experience over 8 weeks โ randomized to LLM and Internet-only.
They tried 5 laboratory tasks, 3 of which are central to a viral reverse genetics workflow. No protocols given โ just an objective.
We ran a randomized controlled trial to see if LLMs can help novices perform molecular biology in a wet-lab.
The results: LLMs may help in some aspects, but we found no significant increase at the core tasks end-to-end. That's lower than what experts predicted.
Our findings ๐งต