Mike Zhang's Avatar

Mike Zhang

@mjjzha

Postdoc β€” University of Copenhagen #NLPxEducation #NLPxHR #NLP Past: πŸ‡©πŸ‡° Aalborg University πŸ‡©πŸ‡° IT University of Copenhagen πŸ‡¨πŸ‡­ EPFL πŸ‡ΈπŸ‡¬ National University of Singapore πŸ‡©πŸ‡ͺ NEC πŸ‡³πŸ‡± University of Groningen 🌐 https://jjzha.github.io/

296
Followers
423
Following
22
Posts
06.11.2024
Joined
Posts Following

Latest posts by Mike Zhang @mjjzha

Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

13.02.2026 19:27 πŸ‘ 22 πŸ” 12 πŸ’¬ 1 πŸ“Œ 0

Are you attending NAACL 2025 and are you interested in low-resource languages and dialects?

Then don't miss our very own @verenablaschke.bsky.social's keynote talk at the WNUT 2025 workshop on May 3rd:

Beyond β€œnoisy” text: How (and why) to process dialect data

🌐 β˜€οΈ
noisy-text.github.io/2025/

15.04.2025 21:49 πŸ‘ 17 πŸ” 5 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸš€ We are excited to introduce Kaleidoscope, the largest culturally-authentic exam benchmark.

πŸ“Œ Most VLM benchmarks are English-centric or rely on translationsβ€”missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal πŸ‘€ VLMs evaluation

10.04.2025 20:24 πŸ‘ 18 πŸ” 7 πŸ’¬ 1 πŸ“Œ 2

NoDaLiDa x Baltic-HLT 2025 is a wrap!

Thank you all for joining for a fruitful conference! Safe trip home and see you in Copenhagen or Vilnius in 2027!!

#nlp #nodalida #baltichlt

05.03.2025 15:11 πŸ‘ 5 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Post image

NoDaLiDa 2027 will be held at the Center of Language Technology at the University of Copenhagen!!

#nodalida #nlp

04.03.2025 15:23 πŸ‘ 13 πŸ” 3 πŸ’¬ 0 πŸ“Œ 1

Welcome to NoDaLiDa / Baltic-HLT 2025!

After the opening speech (9:00), we're kicking of with the opening keynote by

Prof. @arianna-bis.bsky.social (09:20-10:10): "Not all Language Models need to be Large: Studying Language Evolution and Acquisition with Modern Neural Networks". (in LÀÀne-Euroopa)

03.03.2025 06:21 πŸ‘ 7 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Google Maps Find local businesses, view maps and get driving directions in Google Maps.

The first day of workshops is almost a wrap! Join us later today at the welcome reception at the Institute of the Estonian Language (maps.app.goo.gl/brzig4jP6ZfZ...) from 18:30 onwards!!

#nlp #nodalida #baltichlt

02.03.2025 15:15 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Now conference-approved!

#nlp #nlproc #nodalida #baltichlt #tallinn

02.03.2025 14:03 πŸ‘ 3 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Morning!!! We're excited to welcome you in Tallinn!

On March 2nd (Sunday), we're starting with workshops in the Hestia Hotel Europa:

RESOURCEFUL 2025 (9:00-17:00): shorturl.at/HypPv

NB-REAL 2025 (9:00-13:00): nbreal.xyz

NLP4Ecology 2025 (13:30-17:30): nlp4ecology2025.di.unito.it

#nlp #nlproc

02.03.2025 06:23 πŸ‘ 4 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Post image

Heading to Tallinn for @nodalida.bsky.social! πŸ‡ͺπŸ‡ͺ We’re presenting our work on:

πŸ‡©πŸ‡° Sun, Mar 2: "DaKultur: Evaluating the Cultural Awareness of LLMs for Danishβ€œ (10:30)
πŸ€– Tue, Mar 4: "SnakModel: Lessons from Training Our Open Danish LLM" (10:45)

Finally, networking over some lovely Estonian soup! 🍲

01.03.2025 21:57 πŸ‘ 16 πŸ” 2 πŸ’¬ 1 πŸ“Œ 1
Preview
NoDaLiDa/Baltic-HLT 2025 - Proceedings The proceedings of NoDaLiDa/Baltic-HLT 2025, the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies, are published byΒ the University of...

The NoDaLiDa x Baltic-HLT Proceedings are up!

See here: www.nodalida-bhlt2025.eu/proceedings

See you also soon in Tallinn!

#NLP #NLProc #nodalida #baltichlt

28.02.2025 10:51 πŸ‘ 5 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0
LxMLS 2025 - The 15th Lisbon Machine Learning Summer School

The Lisbon Machine Learning Summer School (LxMLS) 2025 is now open for applications.

I’ve done the Covid version myself, and I found the content to be very useful. I went to Lisbon on another occasion, which is also a huge recommendation!!

lxmls.it.pt/2025/

27.02.2025 16:10 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The technical report is now out, loads of interesting insights into multilingual LLM pre-training ✍️

arxiv.org/abs/2502.12982

Congrats Longxu, Qian, Fan, Changyu and team!

#nlp

27.02.2025 16:04 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

This work has now been accepted at #CVPR2025 🀘

27.02.2025 16:01 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
NoDaLiDa/Baltic-HLT 2025 - Program All times are local (GMT+2/UTC+2). See detailed program below.

πŸš€ Thank you all for waiting! The full program of NoDaLiDa x Baltic-HLT is online:

www.nodalida-bhlt2025.eu/program

#nodalida #baltichlt #nlp #nlproc

18.02.2025 15:26 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

NoDaLiDa/Baltic-HLT is in less than two weeks!

Some of the places you can visit:

Kalamaja, one of Tallinn's oldest districts, is known for its wooden houses and hipster vibe, with Telliskivi Creative City as its cultural hub. Nearby, Noblessner offers waterfront views and diverse dining options.

17.02.2025 13:27 πŸ‘ 4 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

Looking for a PhD student to come work with me on the ethical implications of NLP from September!

Please share widely and point any interesting students my way! 😊

11.02.2025 14:57 πŸ‘ 26 πŸ” 15 πŸ’¬ 0 πŸ“Œ 0
Preview
POSTDOC IN NATURAL LANGUAGE PROCESSING SECURITY (NLPSec) (2025-224-06224) The postdoc will be working with the Natural Language Processing(NLP) team in the Data, Knowledge, and Web Engineering(...

The NLP group at Aalborg University (*Copenhagen campus*) is hiring a postdoc in NLP Security:

(ddl: 15 April)

www.vacancies.aau.dk/scientific-p...

#nlproc #nlp #aau

10.02.2025 08:50 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Otherwise there is also the Pierre Chocolaterie in the hidden Masters’ Courtyard, or a number of other trendy establishments. When it comes to views, you can’t beat those from the Old Town Wall, its towers and Toompea Hill’s viewing platforms!!

See you soon!

#nodalida #baltichlt #nlp #nlproc

10.02.2025 08:42 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

NoDaLiDa/Baltic-HLT is less than a month away!

Did you know Talllinn is a living UNESCO treasure and also has a cafΓ© culture? One such example is Maiasmokk, the oldest cafΓ© in Tallinn dating back to 1864!

10.02.2025 08:42 πŸ‘ 4 πŸ” 2 πŸ’¬ 1 πŸ“Œ 1

Appreciate it! Attached a new post with working links

22.01.2025 12:57 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Now with working links:

Doc: docs.google.com/document/d/1...
Spreadsheet: docs.google.com/spreadsheets...

22.01.2025 12:56 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

(DL end of Jan 2025)

22.01.2025 12:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

We're missing coverage in Europe, especially from high resource languages like French, German, Italian. If you're interested, check our guidelines (docs.google.com/document/d/1...) and suggestions google sheet (docs.google.com/spreadsheets...).

Join us on the Aya Discord: discord.gg/4B5tEdbP

#NLP

22.01.2025 12:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 3 πŸ“Œ 0

Hi folks, in collaboration with @cohereforai.bsky.social, we're looking for contributors to a Multilingual **Multimodal** Exams benchmark in MCQ style. What's in it for you:

Submit 1000Qs for high/mid-resource, or for 500 low-resource langs to be eligible for co-authorship.

22.01.2025 12:33 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
The Learning Dynamics of a PhD This is what a PhD looks like: 9.6 million seconds of research.

9.6 million seconds = 1 PhD πŸ”₯

Finally analyzed my PhD time tracking data so you can plan your own research journey more effectively: mxij.me/x/phd-learning-dynamics

For current students: I hope this helps put your journey into perspective. Wishing you all the best!

23.12.2024 22:08 πŸ‘ 36 πŸ” 7 πŸ’¬ 0 πŸ“Œ 1
Preview
GitHub - nlpnorth/snakmodel: An LLM continually pre-trained specifically for Danish. An LLM continually pre-trained specifically for Danish. - nlpnorth/snakmodel

The paper is going to be presented @nodalida.bsky.social:
πŸ“„: arxiv.org/abs/2412.12956
πŸ’»: github.com/nlpnorth/sna...
πŸ€—: huggingface.co/NLPnorth/sna...

20.12.2024 21:07 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Nice to see @mjjzha.bsky.social presenting our joint collaboration with Aallborg University: SnakModel, a new language model for Danish πŸ‡©πŸ‡°
(w/ @mxijme.bsky.social @elisabassignana.bsky.social and Rob van der Goot)

19.12.2024 11:01 πŸ‘ 6 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

Thanks for the warm invite @dnnslmr.bsky.social !!!

19.12.2024 12:33 πŸ‘ 8 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0