Daan van Esch (@daanvanesch.nl)

💭We need more research that focuses on aspects beyond accuracy, especially in multilingual AI.

👉Help us explore the importance of culture in building and testing AI, with a few minutes of your time.

Happy to have a chat as well with anyone who's interested in that space!

03.03.2026 19:40 👍 5 🔁 3 💬 1 📌 0

Mooie herinneringen aan mijn bezoek een paar jaar geleden!

27.02.2026 18:05 👍 1 🔁 0 💬 0 📌 0

Okay AfroLID is working now! It's also a quantized model, but it's still 200MB so it's a bit hefty for a metered connection...maybe I'll put it behind a button you can click to load it explicitly. The FastText models go down to ~15MB each so that still feels acceptable without explicit UI action

22.02.2026 16:52 👍 1 🔁 0 💬 0 📌 0

...know if you have any suggestions / feature requests / etc, or if there are models that I should be adding. BTW, caveat in case it's not obvious yet from the page (I'll also make this clearer), the FastText models are quantized to fit in-browser so they won't deliver the exact same results

22.02.2026 16:31 👍 1 🔁 0 💬 1 📌 0

Thanks for spreading the word! Inspired by your great paper :) I still need to fix AfroLID inference and I also want to spend a bit of time improving the intro at the top of the page, adding references to all the LID models, and so on. So definitely work in progress, but don't hesitate to let me...

22.02.2026 16:31 👍 1 🔁 0 💬 1 📌 0

Screenshot of old vs new ocr. old ocr text is garbled. New ocr much cleaner.

Re-OCR'd the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @hf.co Jobs.

- 0.9B model (GLM-OCR)
~$0.002/page
~$5 total on an L4 GPU

Before (old Tesseract ocr) → After

19.02.2026 11:29 👍 96 🔁 16 💬 5 📌 6

🌱Very proud of our team's latest release 😊 meet Tiny Aya, a massively multilingual model with 3.35B parameters.

Tech report here: github.com/Cohere-Labs/...

18.02.2026 02:16 👍 33 🔁 7 💬 1 📌 0

It's just a static HTML file with all the CSS and JS embedded, and it runs in-browser, so it's easy to move around, and it should be reasonably straightforward to have multiple copies pointing to multiple preconfigured EAF files (plus audio) so you can like, potentially show a corpus or something.

17.02.2026 20:52 👍 1 🔁 0 💬 1 📌 0

EAF Viewer

daanvanesch.nl/eaf-viewer.h... is the one I was playing with the other day, it currently scrolls horizontally but I'm sure Gemini/Codex/Claude would quickly make that vertical instead. Haven't had the chance to test it on a lot of EAF files yet so it may not work out-of-the-box but happy to help!

17.02.2026 20:52 👍 1 🔁 0 💬 1 📌 0

LingView

Cool yeah that should work, let me dig it up! BTW there's also brownclps.github.io/LingView/#/s... which is github.com/BrownCLPS/Li... -- definitely also an option especially if you're comfortable poking around in the terminal, see github.com/BrownCLPS/Li...

17.02.2026 19:51 👍 1 🔁 0 💬 1 📌 0

Two great groups teaming up, looking forward to seeing the impact you'll be delivering together!

17.02.2026 19:32 👍 2 🔁 0 💬 0 📌 0

I (had my coding agent) put something together the other day that did this for some ELAN eaf files, is that what you're using as well? I can dig it up, probably quite doable for like, a Praat TextGrid file too

17.02.2026 19:31 👍 1 🔁 0 💬 1 📌 0

I'm not sure about WordPress plugins etc but for a format like ELAN eaf files it's reasonably straightforward to have one of the modern coding agents whip something up if you're just working without a CMS like WordPress in the loop. Is it an existing page you'd want to add the time-aligned text to?

17.02.2026 18:38 👍 2 🔁 0 💬 1 📌 0

Great to see this amazing collaborative work on an absolutely key problem in building tech that works well across the world's languages: language classification in web text. Often ignored, it's still one of my personal favorite areas to work in. Congrats and thank you to everyone who worked on this!

13.02.2026 20:44 👍 2 🔁 1 💬 0 📌 0

Why care about LangID on crawled data? It's the first gate in the multilingual data pipeline. If your LID model misclassifies a low-resource language as noise or confuses it with a related high-resource one, that language doesn't make it into your corpus.

Bad LangID = no data.

13.02.2026 19:46 👍 6 🔁 1 💬 1 📌 0

Oh I see, a lot of them are controlled manually, got it! Still, always a nice city to visit so maybe I'll squeeze it in somewhere in the next few weeks. Thanks!

13.02.2026 18:49 👍 1 🔁 0 💬 0 📌 0

That's cool! I've always wanted to learn more about premodern automata like this, supposedly there were also some mechanical marvels in the Tang dynasty. Looks like I'll have to make my way over to Aachen sometime soon!

13.02.2026 18:09 👍 1 🔁 0 💬 1 📌 0

But when I was in Leuven last year and asked (in an otherwise Dutch sentence) for a pain au chocolat at the bakery, the lady did laugh and say that there's no need to pull out fancy French words. To me it's very standard in nl-nl to just use the French name. Now I know the proper Flemish Dutch name!

09.02.2026 08:22 👍 4 🔁 0 💬 2 📌 0

In my Netherlandic Dutch I always just call this a "pain au chocolat" and that's also what some brands of baked goods at AH call them, but the ones AH sells from their own bakery are "chocoladebroodjes". I'd never heard "chocoladekoek".

09.02.2026 08:22 👍 5 🔁 0 💬 2 📌 0

Dark blue background; to the right, Special Session logo of circular graphic with block colours and waveform through the middle, on white square. To the left, white text 'Indigenous Voices in Speech Sciences and Technology', with Interspeech 2026 logo above, and website interspeech2026.org below.

What are roles of speech science and technology projects in advancing Indigenous cultural vitality and self-determination? Participate in this important discussion in the Special Session 'Indigenous Voices in Speech Sciences and Technology' at #Interspeech2026 indigenousvoicesinterspeech.github.io

05.02.2026 08:00 👍 2 🔁 4 💬 0 📌 0

The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability."

26.01.2026 23:25 👍 4 🔁 1 💬 0 📌 0

TranslateGemma: A new suite of open translation models TranslateGemma is a new family of open translation models built on Gemma 3.

Meet TranslateGemma. 💎
✅ Open weights (4B, 12B, 27B)
✅ 55 languages + 100s more in training data
✅ Multimodal capabilities (image text)
Blog: blog.google/innovation-a...
Paper: arxiv.org/pdf/2601.09012
Model: huggingface.co/collections/...
Cookbook: colab.research.google.com/github/googl...

16.01.2026 19:50 👍 38 🔁 4 💬 1 📌 2

...wherever we can (so you can see if we used e.g. Glottolog's lat-long or some other source) but agreed that even more detailed descriptions of how each speaker estimate and writing system was arrived at would've been great. Still we hoped that on the whole it might be useful, but totally agree!

11.01.2026 10:33 👍 1 🔁 0 💬 1 📌 0

Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report descri...

it'd be something we'd release publicly. This was part of the work we did for doi.org/10.48550/arX... and then we eventually ended up releasing it publicly in 2022 and 2024 LREC papers. In the detailed LinguaMeta JSON files for each language we do provide a back reference...

11.01.2026 10:33 👍 2 🔁 0 💬 1 📌 0

Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report descri...

...but it's quite challenging, also involving lots of non-academic sources, and sometimes it's also just a judgment call. I wish we'd documented it in a more structured way looking back! But when we were working on this like 10 years ago it didn't occur to me at the time that someday...

11.01.2026 10:33 👍 1 🔁 0 💬 1 📌 0

Yeah, totally agree, this is one of those things that retroactively I wish I'd done...the back story is that when we were working to figure out what languages to add to Gboard (Google's Android keyboard; hence the writing systems), we pored over countless resources to create best-effort estimates...

11.01.2026 10:33 👍 1 🔁 0 💬 1 📌 0

Our goal is definitely not to replace Glottolog or any of the existing amazing resources, we just thought it'd be good to have a public version of our internal visualization that we use to give a broader (non-linguistics) audience a sense of the world's rich linguistic diversity. Happy to chat more!

11.01.2026 10:03 👍 1 🔁 0 💬 1 📌 0

Let me check on this on Monday, would be great to make the table (and paper) Lorena linked to more easily accessible from the visualization site. We're huge Glottolog fans and the visualization was intended to live alongside the paper (presented at LREC 2024) but we'll make the link clearer!

11.01.2026 10:03 👍 1 🔁 0 💬 1 📌 0

In Aosta wordt historisch ook Arpitaans gesproken geloof ik dus wellicht heeft het daarmee te maken? In ongeveer dezelfde hoek van de wereld, in Monaco, bordjes in het Frans en het Monegaskisch (een soort Ligurijns, vroeger tot in Nice gesproken):

09.12.2025 20:36 👍 3 🔁 0 💬 1 📌 0

In het gebied in de Blue Ridge Mountains waar Cherokee wordt gesproken:

09.12.2025 19:10 👍 1 🔁 0 💬 1 📌 0

Daan van Esch

Latest posts by Daan van Esch @daanvanesch.nl