A single color in a prompt can change an LLM's prediction.
As Hila Gonen notes:
Likes yellow → school bus driver
Likes red → firefighter
Seen similar prompt sensitivity in LLMs?
#WiAIR_podcast 🎙️: youtu.be/Lsq3UzM8wIg
@wiair
WiAIR is dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our goal is to empower early career researchers, especially women, to pursue their passion for AI and make an impact in this exciting field.
A single color in a prompt can change an LLM's prediction.
As Hila Gonen notes:
Likes yellow → school bus driver
Likes red → firefighter
Seen similar prompt sensitivity in LLMs?
#WiAIR_podcast 🎙️: youtu.be/Lsq3UzM8wIg
Happy International Women's Day, and happy birthday to #WiAIR!
#wiair_podcast
By reusing embeddings already produced during generation, OMNIGUARD is ≈120× faster than the fastest baseline in their evaluation.
🎬YouTube: www.youtube.com/watch?v=Lsq3...
🎙️Spotify: open.spotify.com/show/51RJNlZ...
🍎Apple Podcasts: podcasts.apple.com/ca/podcast/w...
📄Paper: arxiv.org/pdf/2505.23856
OMNIGUARD shows strong harmfulness classification across 73 languages (including low-resource & cipher languages) and extends moderation to image and audio prompts. It is also sample-efficient, achieving strong performance with far less training data than some baselines. (4/5 🧵)
Key idea: U-Score identifies model layers whose embeddings align across languages or modalities (text↔translations, image↔captions, audio↔transcripts). A lightweight classifier trained on these embeddings generalizes across multilingual and multimodal settings. (3/5 🧵)
OMNIGUARD detects harmful prompts using internal representations of LLMs and MLLMs, without requiring a separate guard model. The classifier operates on embeddings from the base model, making the approach efficient but requiring access to internal representations. (2/5 🧵)
✨ How can we reliably detect harmful prompts across languages, images, and audio?
In our latest #WiAIR episode, we host Dr. Hila Gonen to discuss “OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities”. (1/5 🧵)
🎙️ 𝐍𝐞𝐰 #𝐖𝐢𝐀𝐈𝐑 𝐄𝐩𝐢𝐬𝐨𝐝𝐞 𝐎𝐮𝐭!
In the new #WiAIRpodcast episode with Hila Gonen, we talk about semantic leakage, interventional analysis of LLMs, and the line between bias, hallucination, and leakage.
📷 YouTube: youtu.be/Lsq3UzM8wIg
Subscribe on Youtube and never miss a new #WiAIR_podcast episode!
youtu.be/KKHu_BP5Mac
Remember "Lipstick on a Pig", where they showed that many embedding debiasing methods don't remove bias, just hide it.
In the upcoming #WiAIR episode, I speak with its author Hila Gonen about taking this further into LLMs: semantic leakage and other hidden failures.
🎙️ Our next #WiAIR_podcast guest: Hila Gonen!
Assistant Professor @cs.ubc.ca, she works at the intersection of NLP & ML, aiming to make LLMs responsible, reliable, and fair across languages and socio-demographic groups.
Stay tuned 🎧 www.youtube.com/@WomeninAIRe...
Reasoning traces look like explanations but are they? Letitia Parcalabescu argues that in reasoning LLMs anything leading to the right answer gets reinforced, even incoherent or emoji-filled traces.
🎬 Dive deeper in the full #WiAIR_podcast episode: youtube.com/watch?v=gzQi...
🎧 Dive deeper in our conversation!
🎬 YouTube: www.youtube.com/watch?v=gzQi...
🎙️ Spotify: open.spotify.com/episode/5BmX...
🍎 Apple Podcasts: podcasts.apple.com/ca/podcast/f...
📄 Paper: arxiv.org/pdf/2512.11614
Across Llama-3.2 and Qwen3 models on SQuAD2.0, HotpotQA, and TriviaQA, the method preserves accuracy while improving groundedness, robustness, retriever performance, and reaching EIFcond ≥ 0.3. (4/5 🧵)
This training induces completeness, soundness, and emergent reject behavior without annotated unanswerable data. The authors derive mutual-information lower bounds and introduce the Explained Information Fraction (EIF). (3/5 🧵)
The paper reframes RAG as an interactive proof system. A generator (Arthur) is trained against supportive evidence from Merlin and adversarial context from Morgana, using ATMAN-based masking. (2/5 🧵)
✨ Can we give formal, information-theoretic guarantees against hallucinations in RAG systems?
In our latest #WiAIR episode, we host Dr. Letitia Parcalabescu to discuss "Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols". (1/5 🧵)
🎬 Watch or listen to the full episode:
YouTube ▶️ youtu.be/gzQiDCG_j7A?...
Spotify 🎙 open.spotify.com/episode/5BmX...
Apple 🎧 podcasts.apple.com/ca/podcast/f...
#WiAIR #WomenInAI #AIResearch (8/8🧵)
⚠️ Why this matters
Chain-of-Thought explanations increase image usage — but they do not guarantee faithful reasoning.
A convincing explanation is not the same as a faithful one. (6/8🧵)
📊 Benchmark reality check.
On VALSE, decoders achieve strong results in easier pairwise settings, but still struggle with several linguistic phenomena in harder settings.
High scores don’t automatically mean strong multimodal grounding. (5/8🧵)
🔄 Self-consistency remains limited.
Using CC-SHAP, the paper shows that many VLM decoders are less self-consistent than LLMs — meaning the inputs driving the answer are not always the same as those driving the explanation. (4/8🧵)
🖼 Explanations use images more.
When generating explanations — especially in Chain-of-Thought (CoT) — image contributions increase significantly compared to answer generation.
Models rely more on visual signals when explaining than when answering. (3/8🧵)
📌 Key takeaways:
Answers are largely text-driven. Across VQA, GQA, MSCOCO & VALSE, tested 7B VLM decoders rely much more on text tokens than on image patches when generating answers.
Multimodal ≠ equally multimodal. (2/8🧵)
🧠 Do Vision & Language Decoders Use Images and Text Equally?
In our latest episode, we speak with Letitia Parcalabescu about her ICLR 2025 paper examining how vision–language *decoder* models use images and text — and how self-consistent their explanations really are. (1/8🧵)
That could be a perfect summary of our last episode with Letitia on #WiAIR_podcast.
Make sure you don't miss our interview:
🎬 YouTube: youtu.be/gzQiDCG_j7A
If you want the full discussion on faithfulness, consistency & reasoning models — watch or listen below 👇
🎬 YouTube: youtu.be/gzQiDCG_j7A?...
🎙 Spotify: open.spotify.com/episode/5BmX...
🎧 Apple Podcasts: podcasts.apple.com/ca/podcast/f...
📄 Paper: aclanthology.org/anthology-fi... (8/8🧵)
They also propose CC-SHAP, a fine-grained metric comparing how input tokens contribute to the answer vs. the explanation — offering a more detailed self-consistency signal. 🧩✨ (7/8🧵)
CCB evaluates 11 open LLMs across 5 tasks, enabling direct comparison.
A key finding: different tests often disagree on the same model. ⚖️🔍 (6/8🧵)
To study this systematically, the authors introduce the Comparative Consistency Bank (CCB) — a unified benchmark evaluating multiple consistency/faithfulness tests under the same setup. 📚🧪 (5/8🧵)