We discovered that language models leave a natural "signature" on their API outputs that's extremely hard to fake. Here's how it works π
π arxiv.org/abs/2510.14086 1/
We discovered that language models leave a natural "signature" on their API outputs that's extremely hard to fake. Here's how it works π
π arxiv.org/abs/2510.14086 1/
At @colmweb.org all week π₯―π! Presenting 3 mechinterp + actionable interp papers at @interplay-workshop.bsky.social
1. BERTology in the Modern World w/ @bearseascape.bsky.social
2. MICE for CATs
3. LLM Microscope w/ Jiarui Liu, Jivitesh Jain, @monadiab77.bsky.social
Reach out to chat! #COLM2025
Excited to be attending NEMI in Boston today to present π MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools and co-moderate the model steering and control roundtable! Come find me to connect and chat about steering and actionable interp
At #ACL2025 in Vienna π¦πΉ till next Saturday! Love to chat about anything #interpretability π, understanding model internals π¬, and finding yummy vegan food π₯¬
At #ICML2025 π¨π¦ till Sunday! Love to chat about #interpretability, understanding model internals, and finding yummy vegan food in Vancouver π₯¬π
Congrats π₯³π₯³π₯³π₯³
π¨New #interpretability paper with @nsubramani23.bsky.social: π΅οΈ Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
π¨ Check out our new #interpretability paper: π΅π½ Model Internal Sleuthing led by the amazing @bearseascape.bsky.social who is an undergrad at @scsatcmu.bsky.social @ltiatcmu.bsky.social
Excited to announce that I started at @googleresearch.bsky.social on the cloud team as a student researcher last month working with Hamid Palangi on actionable #interpretability π to build better tool using #agents βοΈπ€
Presenting this today at the poster session at #NAACL2025!
Come chat about interpretability, trustworthiness, and tool-using agents!
ποΈ - Thursday May 1st (today)
π - Hall 3
π - 200-330pm
At #NAACL2025 π΅till Sunday! Love to chat about interpretability, understanding model internals, and finding vegan food π₯¬
Come to our poster in Albuquerque on Thursday 2-330pm in the interpretability & analysis section!
Paper: aclanthology.org/2025.naacl-l...
Code (coming soon): github.com/microsoft/mi...
π§΅/π§΅
MICE π:
π― - significantly beats baselines on expected tool-calling utility, especially in high risk scenarios
β
- matches expected calibration error of baselines
β
- is sample efficient
β
- generalizes zeroshot to unseen tools
5/π§΅
Calibration is not sufficient: both an oracle and a model that just predicts the base rate are perfectly calibratedπ€¦π½ββοΈ
We develop a new metric expected tool-calling utility π οΈto measure the utility of deciding whether or not to execute a tool call via a confidence score!
4/π§΅
We propose π MICE to better assess confidence when calling tools:
1οΈβ£ decode from each intermediate layer of an LM
2οΈβ£ compute similarity scores between each layerβs generation and the final output.
3οΈβ£ train a probabilistic classifier on these features
3/π§΅
1οΈβ£ Tool-using agents need to be useful and safe as they take actions in the world
2οΈβ£ Language models are poorly calibrated
π€ Can we use model internals to better calibrate language models to make tool-using agents safer and more useful?
2/π§΅
π Excited to share a new interp+agents paper: ππ± MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025
This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson
1/π§΅
Congrats!!
Congrats! π₯³
Have these people met β¦ society? Read a book? Listened to music? Regurgitating esoteric facts isnβt intelligence.
This is more like humanityβs last stand at jeopardy
www.nytimes.com/2025/01/23/t...
ππ½ looks good to me!
ππ½ Intro
πΌ PhD student @ltiatcmu.bsky.social
π My research is in model interpretability π, understanding the internals of LLMs to build more controllable and trustworthy systems
π«΅π½ If you are interested in better understanding of language technology or model interpretability, let's connect!
ππ½
ππ½
1) I'm working on using intermediate model generations with LLMs to better calibrate tool using agents βοΈπ€ than the probabilities themselves! Turns out you can π₯³
2) There's gotta be a nice geometric understanding of what's going on within LLMs when we tune them π€
Love to be added too!
Utah is hiring tenure-track/tenured faculty & a priority area is NLP!Β
Please reach out over email if you have questions about the school and Salt Lake City, happy to share my experience so far.Β
utah.peopleadmin.com/postings/154...