Niyati Bafna (@niyatibafna)

Multimodal LLMs can read text in images, but why do they often perform worse than when the same text is given as tokens? Our work studies the modality gap of models perceiving text as pixels and shows how to close it.
📄 arxiv.org/abs/2603.09095
🧵👇 #NLProc #LLM #ComputerVision

12.03.2026 13:32 👍 3 🔁 3 💬 1 📌 0

Frustrated with how most of the world’s low-resource languages have NO evaluation resources?

📢 Check out ChiKhaPo, a massively multilingual lexical comprehension and generation benchmark covering 2700+ languages.
www.arxiv.org/abs/2510.16928

24.11.2025 23:41 👍 1 🔁 2 💬 1 📌 0

Accepted at ACL main! Come chat about dialectal MT at our poster today at 4 pm.
Also, check out this largely bug-free package for generating your own synthetic dialectal data:
pypi.org/project/dial...

29.07.2025 12:14 👍 0 🔁 0 💬 0 📌 0

You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)

15.07.2025 13:03 👍 33 🔁 3 💬 2 📌 0

Thanks! Yeah, that idea's definitely still around :) Although "language-agnostic" to a large extent seems to be "English"(arxiv.org/pdf/2402.18815, aclanthology.org/2024.acl-lon...)

10.07.2025 02:51 👍 1 🔁 0 💬 0 📌 0

This work was done with my amazing collaborators Tianjian Li, @kentonmurray.bsky.social , @davidrmortensen.bsky.social , David Yarowsky, Hale Sirin, and @danielkhashabi.bsky.social, @jhuclsp.bsky.social .

04.07.2025 17:04 👍 1 🔁 0 💬 0 📌 0

If you can’t decide whether to go end-to-end or MT cascade for your next multilingual experiments, or you want to build alternative architectures or adapters for multilingual LLMs, or you wanted to know why God why can't LLMs solve tasks in other languages, this paper is for you.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

Main takeaway: Translation failure is an important failure mode! Your model may be having wise and intelligent thoughts all up to its last couple of layers, and then failing to communicate them in Telugu because (like me) it has tried but failed to learn Telugu.

04.07.2025 17:04 👍 4 🔁 1 💬 2 📌 0

We break down the patterns in the above figure by source and target language, talk about what makes the neat pipeline picture a little more complicated, and show briefly what happens with a bigger model (spoiler: things improve but not too much). See paper for details!

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

In general, intermediate accuracy stays high even for LRL targets, but final accuracy quickly drops. And so TLP is high for most target languages (>50%). Except for low-resource *source* languages, in which case task-solving fails before we get to translation.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

We then quantify *translation loss proportion*: the proportion of failure cases that had successful task-solving but failed translation (see paper for less hand-waviness). We look at intermediate task-solving accuracy (over all layers), final accuracy, and TLP.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

What languages does task-solving occur in? We look at the distribution over languages of correct intermediate outputs and see that 1) English dominates 2) But other supported HRLs have a considerable combined presence! Also, this mix looks largely the same regardless of target language.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

We visualize the task-solving—>translation pipeline, showing that intermediate layers have high *off-target* accuracy (task-solving), which gets converted (via translation) to *on-target* accuracy near the final layers. For HRLs. For LRL target languages, translation fails, resulting in bad outputs.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

We look at a word translation task for 108 language pairs, and use logit lens to trace *task-solving accuracy* (correct semantics regardless of language) and *on-targetness* (correct target language) over model layers.

04.07.2025 17:04 👍 1 🔁 0 💬 1 📌 0

This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.

04.07.2025 17:04 👍 2 🔁 1 💬 1 📌 0

🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724

04.07.2025 17:04 👍 26 🔁 7 💬 2 📌 1

This work was done in (a super fun) collaboration with Matthew Wiesner, at the HLTCOE and @jhuclsp.bsky.social.

07.06.2025 17:27 👍 1 🔁 0 💬 0 📌 0

Apparently the ECAPA-TDNN model thinks I'm speaking Bengali when I read out Wordsworth to it. I wish I spoke Bengali. I wish Wordsworth spoke Bengali. But the cold harsh truth: SOTA LID should be better.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

This module by itself shows very little accent-language confusion. In combination with the ECAPA-TDNN model, it shows large improvements on LID for L2-accented speech in English, French, and German, and minimal degradation on mainstream accented speech.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

Okay, so how do we fix this problem? We investigate using a module that incorporates long-range information to help out. We look at two representations of the input: as a sequence of phones and a sequence of discretised SSL representations. And we put a classifier on top.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

This suggests that language identification models behave like accent identification models under the hood, largely relying on short-range phonotactics. When the accent-language association is broken, e.g. for L2-accented speech, LID models break. Badly!

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

Models that show less block permutation invariance, such as the GEO model (aclanthology.org/2024.naacl-l...), also appear more robust to L2 accents.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

To test this, we look at *block permutation invariance* i.e. the length of ordered as well as unordered input features that SOTA models rely on; our experiments indicate that they use features describing only about 1-2 phones.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

Our hypothesis: this is caused by the model’s using too-short features. The intuition is that accents are characterised by short phone-usage-type features, languages by vocabulary and syntax. L2 accented speech imposes the former over the latter, causing confusion when models are short-sighted.

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

Accent-language confusion: The mis-recognition of L2-accented speech as the L1 substrate or a related language. For example, when Indonesian-accented English is classified as Indonesian, Malay, etc. A large part of model error on L2-accented speech follows this pattern!

07.06.2025 17:27 👍 1 🔁 0 💬 1 📌 0

We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.

07.06.2025 17:27 👍 6 🔁 3 💬 1 📌 0

Presented DialUp (MT, dialect continua, robustness, etc.; arxiv.org/abs/2501.16581) to some new people this week! Thanks Hale and @schmidtsciences.bsky.social for inviting me up to New York 🥯

Saw some magnolias too :)

11.04.2025 00:50 👍 2 🔁 0 💬 0 📌 0

This work was done with my amazing collaborators: Emily Chang, Nathaniel Robinson, @davidrmortensen.bsky.social , @kentonmurray.bsky.social , David Yarowsky, and Hale Sirin, @jhuclsp.bsky.social.

27.02.2025 02:44 👍 4 🔁 0 💬 0 📌 0

We hope you use DialUp!
Our code for 1) artificial dialect generation for supported languages 2) M—>D fine-tuning, 3) D—>M inference, as well as the lexicons we collated for all our languages (including function word lexicons!) are here:
github.com/niyatibafna/...

27.02.2025 02:44 👍 3 🔁 0 💬 1 📌 0

DialUp takes a step towards making MT models robust to dialectal variation in a principled and systematic way. M—>D is cheap to train and offers robustness to *general* dialectal variation.
And D—>M is training-free and only requires tiny and easily collectable function word lexicons!

27.02.2025 02:44 👍 2 🔁 0 💬 1 📌 0

Niyati Bafna

Latest posts by Niyati Bafna @niyatibafna