Can AI simulate human behavior? π§
The promise is revolutionary for science & policy. But thereβs a huge "IF": Do these simulations actually reflect reality?
To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
28.10.2025 16:53
π 11
π 5
π¬ 1
π 1
Looking forward to chat about limitations of AI annotators/LLM-as-a-Judge, opportunities for improving them, evaluating AI personality/character, and the future of evals more broadly!
27.07.2025 15:22
π 1
π 0
π¬ 0
π 0
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given twoβ¦
π I'll be at #ACL2025 presenting research from my Apple internship! Our poster is titled: "Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?"
β Let's meet: come by our poster on Tuesday (29/7), 10:30 - 12:00, Hall 4/5, or DM me to set up a meeting!
βοΈ Paper link below β
27.07.2025 15:22
π 4
π 0
π¬ 1
π 0
Excited to be in Singapore for ICLR! Keen to chat about interpreting feedback data and detecting model characteristics βοΈ
Reach out or come by our poster on Inverse Constitutional AI on Friday 25 April from 10am-12.30pm (#520 in Hall 2B) - @timokauf.bsky.social and I will be there!
24.04.2025 15:47
π 0
π 0
π¬ 0
π 1
If you want to understand your own model and data better, try Feedback Forensics!
πΎ Install it from GitHub: github.com/rdnfn/feedba...
β―οΈ View interactive results: app.feedbackforensics.com?data=arena_s...
17.04.2025 13:55
π 2
π 0
π¬ 0
π 0
βοΈ Conclusion: The differences between the arena and the public version of Llama 4 Maverick highlight the importance of having a detailed understanding of preference data beyond single aggregate numbers or rankings! (Feedback Forensics can help!)
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
π Bonus 2: Humans like the arena modelβs behaviours
Human annotators on Chatbot Arena indeed like the change in tone, more verbose responses and adapted formatting.
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
π Bonus 1: Things that stayed consistent
I also find that some behaviours stayed the same: on the Arena dataset prompts, the public and arena model versions are similarly very unlikely to suggest illegal activities, be offensive or use inappropriate language.
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
Feedback Forensics App
β‘οΈ Further differences: Clearer reasoning, more references, β¦
There are quite a few other differences between the two models beyond the three categories already mentioned. See the interactive online results for a full list: app.feedbackforensics.com?data=arena_s...
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
3οΈβ£ Third: Formatting - a lot of it!
The arena model uses more bold, italics, numbered lists and emojis relative to its public version.
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
2οΈβ£ Second: Tone - friendlier, more enthusiastic, more humour β¦
Next, the results highlight how much friendlier, emotional, enthusiastic, humorous, confident and casual the arena model is relative to its own public weights version (and also its opponent models).
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
So how exactly is the arena version different to the public Llama 4 Maverick model? I make a few observationsβ¦
1οΈβ£ First and most obvious: Responses are more verbose. The arena modelβs responses are longer relative to the public version for 99% of prompts.
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
π Note on interpreting metrics: values above 0 β characteristic more present in arena model's responses than public model's. See linked post for details
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
π§ͺ Setup: I use the original Arena dataset of Llama-4-Maverick experimental generations, kindly released openly by @lmarena (π). I compare the arena modelβs responses to those generated by its public weights version (via Lambda and OpenRouter).
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
βΉοΈ Background: Llama 4 Maverick was released earlier this month. Beforehand, a separate experimental Arena version was evaluated on Chatbot Arena (Llama-4-Maverick-03-26-Experimental). Some have reported that these two models appear to be quite different.
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?π΅οΈ
I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overviewβ¦ππ§΅
17.04.2025 13:55
π 0
π 0
π¬ 1
π 0
Feedback Forensics is just getting started with this Alpha release with lots of exciting features and experiments on the roadmap. Let me know what other datasets we should analyze or which features you would like to see! π΅π»
17.03.2025 18:12
π 5
π 0
π¬ 0
π 0
GitHub - rdnfn/feedback-forensics: A tool to investigate pairwise feedback: understand and find issues in your data
A tool to investigate pairwise feedback: understand and find issues in your data - rdnfn/feedback-forensics
Big thanks also to my collaborators on Feedback Forensics and the related Inverse Constitutional Al (ICAI) pipeline: Timo Kaufmann, Eyke HΓΌllermeier, @samuelalbanie.bsky.social, Rob Mullins!
Code: github.com/rdnfn/feedback-forensics
Note: usual limitations for LLM-as-a-Judge-based systems apply.
17.03.2025 18:12
π 2
π 0
π¬ 1
π 0
Feedback Forensics App
... harmless/helpful data by @anthropic.com, and finally the recent OLMo 2 preference mix by @ljvmiranda.bsky.social, @natolambert.bsky.social et al., see all results at app.feedbackforensics.com.
17.03.2025 18:12
π 0
π 0
π¬ 1
π 0
We analyze several popular feedback datasets: Chatbot Arena data with topic labels from the Arena Explorer pipeline, PRISM data by @hannahrosekirk.bsky.social et al, AlpacaEval annotations, ...
17.03.2025 18:12
π 1
π 0
π¬ 1
π 0
π€ 3. Discovering model strengths
How is GPT-4o different to other models? β Uses more numbered lists, but Gemini is more friendly and polite
app.feedbackforensics.com?data=chatbot...
17.03.2025 18:12
π 0
π 0
π¬ 1
π 0
π§βπ¨π§βπΌ 2. Finding preference differences between task domains
How do preferences differ across writing tasks? β Emails should be concise, creative writing more verbose
app.feedbackforensics.com?data=chatbot...
17.03.2025 18:12
π 0
π 0
π¬ 1
π 0
ποΈ 1. Visualizing dataset differences
How does Chatbot Arena differ from Anthropic Helpful data? β Prefers less polite but better formatted responses
app.feedbackforensics.com?data=chatbot...
17.03.2025 18:12
π 1
π 0
π¬ 1
π 0
π΅π»π¬ Introducing Feedback Forensics: a new tool to investigate pairwise preference data.
Feedback data is notoriously difficult to interpret and has many known issues β our app aims to help!
Try it at app.feedbackforensics.com
Three example use-cases ππ§΅
17.03.2025 18:12
π 7
π 2
π¬ 1
π 0