Brooke Vlahos, Peter Clark, Doug Downey, @yoavgo.bsky.social Ashish Sabharwal, Daniel S. Weld
Brooke Vlahos, Peter Clark, Doug Downey, @yoavgo.bsky.social Ashish Sabharwal, Daniel S. Weld
Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu @guywiener.bsky.social Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka...
๐ Many thanks to my @ai2.bsky.social teammatesโMike DโArcy @nbalepur.bsky.social Dan Bareket, Bhavana Dalvi @sergeyf.bsky.social Dany Haddad, Jena D. Hwang, @peterjansen-ai.bsky.social Varsha Kishore, Bodhisattwa Majumder @arnaik19.bsky.social Sigal Rahamimov, Kyle Richardson...
We tested 22 agent classesโmore *kinds* than other benchmarks
๐คAgentBaselines makes them reusable, incl. our SOTA science agents: github.com/allenai/agent-baselines
๐Blog: allenai.org/blog/astabench
๐Paper: arxiv.org/abs/2510.21652
๐Leaderboard: huggingface.co/spaces/allenai/asta-bench-leaderboard
๐ ๏ธAstaBench is the first to provide reproducible (date-limited) large-scale search toolsโplus a full scientific research environment for agents.
๐Our leaderboard highlights agents that use these tools, enabling more controlled measurement of *AI*. (We measure LLM costs too.)
AstaBench with abstract measurement icons
Agent benchmarks don't measure true *AI* advances
We built one that's hard & trustworthy:
๐ AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
๐ SOTA results across 22 agent *classes*
๐ AgentBaselines agents suite
๐ arxiv.org/abs/2510.21652
๐งต๐
@kylelo.bsky.social your gifs are an unapproved manipulation of my human attention