We ran a controlled study of 125 verified humans vs 5 AI agents. Can agents reliably be detected?
Here's what we found:
www.prolific.com/resources/au...
We ran a controlled study of 125 verified humans vs 5 AI agents. Can agents reliably be detected?
Here's what we found:
www.prolific.com/resources/au...
Beta is currently available in Qualtrics. Weβre actively scoping integrations with additional platforms and the tech is generalisable. If helpful, Iβd be happy to connect you with someone from Prolific to learn more about your feedback?
Frontiers episode 1:
Jerome Wynne from @Prolific in conversation with Crystal Qian, from Google DeepMind, talking about Deliberate Lab: a platform for running online research experiments on human + LLM group dynamics.
www.youtube.com/watch?v=5vyi...
AI agents are becoming a serious threat to research data quality.
Today weβre rolling out Bot authenticity checks on @joinprolific.bsky.social, detecting agentic AI with 100% accuracy in testing.
Comes with a native Qualtrics integration! More info:
www.prolific.com/resources/in...
Fresh HUMAINE results are here.
Gemini 3 is still first, but Mistral Large 3 and Deepseek v3.2 are making things interesting.
Opus 4.5 didn't dominate, but Antropic is likely prioritizing complex reasoning/coding over the conversational fluency that this benchmark favors.
prolific.com/humaine
Lots of chatter about this paper currently. Its a stark warning, but at present I see this as a stark warning of what might come, not what is happening now. As a research community we need to see it as a call-to-arms to develop new strategies, NOT a call to abandon online sampling. Reasoning below
All fair. Expected to see more diversity in modalities also. Qual studies (which can now be done at scale) are likely to be more robust than survey only.
Studies aren't distributed on a first-come first-served basis, but it's a useful theory. I will share with the team.
Right. It is generalisable (JS plugin), though we don't have a native integration with otree yet.
Will reach out to the authors to see if we can understand more details & see if we can add Authenticity Check as a mitigation option.
45% of participants copying OR pasting ~= 45% LLM use.
Only single-digit responses seem to fail their honeypot and other mitigations, which is closer to our internal prevalence measures.
There are many reasons to copy/paste while still being a conscientious human.
"Even to an untrained eye,some of these responses were obviously generated by LLMs", but the % doesn't seem reported?
If I'm reading the paper correctly, their detection of prevalence was "we only tracked copying and pasting on a page containing an openended question" - this is fairly crude measure of llm detection and is upper bound rather than an accurate prevalence measure
LLM use by real humans is a slightly different threat to the scaled agent threat discussed in the paper though, and I think requires a bit more nuance in its response.
I hadn't, thanks for sharing. Agree with many of the mitigation strategies, though given data was collected on Prolific we would have reccommeded our built in tool.
researcher-help.prolific.com/en/articles/...
prolific.com/resources/pr...
If you want to work on these problems, or collaborate on research in this area, get in touch. Much more to come in this space!
Without minimising the seriousness of the threat raised in this paper, I'm more optimistic. This is just the latest challenge in online integrity of online research.
We've been proactively adding to our suite of authenticity tools - more every week - including many of Sean's recommendations:
We also do spot checks to protect against account reselling: participant-help.prolific.com/en/articles/...
Not sure I agree β these are tractable challenges and we are working on them.
bsky.app/profile/phe-...
Tara, let me know if I can share any more info that would put your mind at ease. We have implemented numerous mitigations to ensure the collection of authentic data (and continue to assess and invest further).
www.prolific.com/resources/in...
There's a lot of work to do here, though, and it's likely to continue being a bit of an arms race. Our bet is on many layers of assessment here, but we are keen to collaborate with researchers & the broader community to address this (very real) threat.
Agreed that this is the harder challenge, though we do have tooling for this also (in addition to throttling). Works with Qualtrics.
& we're continuing to invest here. I'd be happy to work with you or Sean on replication if of interest.
Some notes on our continued identity verification and agent detection here:
participant-help.prolific.com/en/articles/...
www.prolific.com/resources/in...
I disagree. AFAIK, this was not run on Prolific, and we already have the majority of Sean's recommendations in place.
Ongoing Panelist Validation β
Throttling Mechanisms β
Panelist Professionalism β
Panelist Quality Checks β
Location Checks β
Identity Validation β
Secure Software π (partial)
This is a super important paper, but honestly, I think the opposite. Integrity and authenticity of online-based panels have been in question prior to this risk agents. IMO this forces all providers to up their verification and transparency of their participants.e.g.
www.prolific.com/resources/pr...