Computational Linguistics @UPF's Avatar

Computational Linguistics @UPF

@colt-upf

Gemma Boleda, Marco Baroni, Thomas Brochhagen, Iria de Dios Flores | Computational Linguistics and Linguistic Theory Universitat Pompeu Fabra. upf.edu/web/colt Barcelona

713
Followers
355
Following
23
Posts
11.11.2024
Joined
Posts Following

Latest posts by Computational Linguistics @UPF @colt-upf

Sample ManyNames images with associated names, in English and Mandarin Chinese

Sample ManyNames images with associated names, in English and Mandarin Chinese

Releasing v. 2.3 of ManyNames, an object naming dataset with 25K objects in real world images (English, plus partial coverage in Catalan and Mandarin Chinese). Check it out!

amore-upf.github.io/manynames/

(New in this version: further data cleaning, speaker ID, more lexical info)

15.01.2026 14:19 ๐Ÿ‘ 4 ๐Ÿ” 3 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

@ecesuurker.bsky.social presenting NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use (Zhang et al, EMNLP Findings, 2025) !

23.12.2025 11:04 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

LLMs as a synthesis between symbolic and distributed approaches to language (ACL Findings, 2025), a talk by Gemma Boleda @gboleda.bsky.social

23.12.2025 11:04 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

Our group presented our work at Deep Learning BCN! Some highlights below.

@dlbcnai.bsky.social

23.12.2025 11:04 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Amazing work by Jeanne Bruneau-Bongard, Emmanuel Chemla, and Thomas Brochhagen!

08.12.2025 09:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Assessing Pressures Shaping Natural Language Lexica Human languages balance communicative informativity with complexity, conveying as much as needed through the simplest means required to do so. Yet, these conceptsโ€”informativity and complexityโ€”have be...

Many forces have been argued to shape natural language lexica, and there are different ways they can be operationalized and interact. We study which out of a set of forces and their interactions best fit cross-linguistic data. Now out in Cognitive Science: onlinelibrary.wiley.com/doi/10.1111/...

08.12.2025 09:30 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Great work by Xixian Liao, Thomas Brochhagen, @gboleda.bsky.social and @laiamayol.bsky.social !

08.10.2025 13:20 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Do you use a pronoun more often when the entity youโ€™re talking about is more predictable?

Previous work offers diverging answers so we conducted a meta-analysis, combining data from 20 studies across 8 different languages.

Now out in Language: muse.jhu.edu/article/969615

08.10.2025 13:20 ๐Ÿ‘ 3 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

๐Ÿ“ข Seminari de recerca organitzat pel COLT- URLING, "LLM and human language: representations, judgments, and historical change".

๐Ÿ“† 29/09/2025
๐Ÿ•ฆ 15:30
๐ŸŽค Adele Goldberg (Princeton University)
๐Ÿšฉ55.410, Edifici Tร nger del Campus Poblenou - UPF
โ„น๏ธ ja.cat/wi2t7

@colt-upf.bsky.social

26.09.2025 08:12 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

๐Ÿ“ข Seminari de recerca organitzat pel COLT- URLING, "Associative memory in psycholinguistics and in AI architectures".

๐Ÿ“† 01/10/2025
๐Ÿ•ฆ 12:00
๐ŸŽค Jakub Dotlaฤil
๐Ÿšฉ55.410, Edifici Tร nger del Campus Poblenou - UPF
โ„น๏ธ ja.cat/U5xH2

@colt-upf.bsky.social

29.09.2025 10:45 ๐Ÿ‘ 0 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Sigmoid function. Non-linearities in neural network allow it to behave in distributed and near-symbolic fashions.

Sigmoid function. Non-linearities in neural network allow it to behave in distributed and near-symbolic fashions.

New paper! ๐Ÿšจ I argue that LLMs represent a synthesis between distributed and symbolic approaches to language, because, when exposed to language, they develop highly symbolic representations and processing mechanisms in addition to distributed ones.
arxiv.org/abs/2502.11856

30.09.2025 13:15 ๐Ÿ‘ 27 ๐Ÿ” 11 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Postdoc in Natural Language Processing

๐Ÿ“ขI am hiring a Postdoc to work on post-training methods for low-resource languages. Apply by August 15 employment.ku.dk/faculty/?sho....
Let's talk at #ACL2025NLP in Vienna if you want to know more about the position and life in Denmark.

07.07.2025 12:47 ๐Ÿ‘ 23 ๐Ÿ” 12 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Screenshot of first page of paper. It is here: https://arxiv.org/pdf/2507.00828

Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations

Screenshot of first page of paper. It is here: https://arxiv.org/pdf/2507.00828 Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations

Evaluating topic models (and document clustering methods) is hard. In fact, since our paper critiquing standard evaluation practices four years ago, there hasn't been a good replacement metric

That ends today (we hope)! Our new ACL paper introduces an LLM-based evaluation protocol ๐Ÿงต

08.07.2025 12:40 ๐Ÿ‘ 52 ๐Ÿ” 10 ๐Ÿ’ฌ 3 ๐Ÿ“Œ 2

๐ŸŽ‰New paper "Prediction Hubs are Context-Informed Frequent Tokens in LLMs" from our lab, accepted at ACL 2025!

If you're interested in representational geometry, come find Beatrix Nielsen and Marco Baroni at the poster :)

08.07.2025 11:33 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Today at UPF Campus de la Ciutadella at 2:30 pm! Come slightly earlier to check in!

Sala Polivalent 24S18

maps.app.goo.gl/n1hBxiviKcLW...

02.06.2025 08:38 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿ“ข ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐Ÿ“ข

UPF Campus de la Ciutadella
**Sala Polivalent 24.S18**

Thank you for bearing with us!

29.05.2025 09:46 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Last day to sign up for the COLT Symposium!
Register: tinyurl.com/colt-register

๐Ÿ“ข ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐Ÿ“ข
June 2nd, 14:30 - 19:00

UPF Campus de la Ciutadella
Room 40.101

maps.app.goo.gl/1216LJRsWmTE...

26.05.2025 10:44 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1

โญ Registration open til May 27th! โญ
Website: www.upf.edu/web/colt/sym...

June 2nd, UPF

๐—ฆ๐—ฝ๐—ฒ๐—ฎ๐—ธ๐—ฒ๐—ฟ ๐—น๐—ถ๐—ป๐—ฒ๐˜‚๐—ฝ:
Arianna Bisazza (language acquisition with NNs)
Naomi Saphra (emergence in LLM training dynamics)
Jean-Rรฉmi King (TBD)
Louise McNally (pitfalls of contextual/formal accounts of semantics)

20.05.2025 08:13 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 2

Updated website: www.upf.edu/web/colt/sym...

14.05.2025 16:56 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐—š๐—ฒ๐˜๐˜๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ๐—ฟ๐—ฒ:

๐—ช๐—ต๐—ฒ๐—ป: 2nd June 2025, 14:30 - 19:00
๐—ช๐—ต๐—ฒ๐—ฟ๐—ฒ: UPF Poblenou, Auditori (enter via Roc Boronat building) maps.app.goo.gl/2WMt21hR5L9r...

In-person only, with mandatory registration:
tinyurl.com/colt-register

See you there!

๐Ÿงต3/3

13.05.2025 09:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Our speakers span a wide range of expertise between AI, linguistics, and neuroscience.

14:30 Arianna Bisazza (Uni. Groningen)
15:30 Naomi Saphra (Harvard)

-- coffee break --

17:00 Jean-Rรฉmi King (Meta AI)
18:00 Louise McNally (UPF)

Abstracts: tinyurl.com/colt-site

๐Ÿงต2/3

13.05.2025 09:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Announcing the COLT Symposium on June 2nd!

๐—˜๐—บ๐—ฒ๐—ฟ๐—ด๐—ฒ๐—ป๐˜ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—น๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—ถ๐—ป ๐—บ๐—ถ๐—ป๐—ฑ๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—บ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ๐˜€

What properties of language are emerging from work in experimental and theoretical linguistics, neuroscience & LLM interpretability?

Info: tinyurl.com/colt-site
Register: tinyurl.com/colt-register

๐Ÿงต1/3

13.05.2025 09:00 ๐Ÿ‘ 4 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2

Please find us at #ICLR2025! We will present our work on intrinsic dimension as a cue for stages of language processing in LLMs.

Saturday morning, Poster session 5
Hall 3 + Hall2B #563
iclr.cc/virtual/2025...

Arxiv: arxiv.org/abs/2405.15471

22.04.2025 14:37 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

๐Ÿ“ข Upcoming Seminar

Words are weird? On the role of lexical ambiguity in language
๐Ÿ—ฃ Gemma Boleda (Universitat Pompeu Fabra, Spain)
Why is language so ambiguous? Discover how ambiguity balances cognitive simplicity and communicative complexity through large-scale studies.
๐Ÿ“ UniMiB, Room U6-01C, Milan

03.03.2025 13:41 ๐Ÿ‘ 13 ๐Ÿ” 6 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

โšกNew position paper from Gemma Boleda: is it time to make peace between symbolic and continuous approaches to language?

24.02.2025 17:13 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Prediction hubs are context-informed frequent tokens in LLMs Hubness, the tendency for few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data,...

The project I did with Marco Baroni and Iuri Macocco while I was in Barcelona is now on Arxiv: arxiv.org/abs/2502.10201 ๐ŸŽ‰

TLDR below ๐Ÿ‘‡

24.02.2025 08:06 ๐Ÿ‘ 3 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
CoNLL 2025 | CoNLL

This year, CoNLL will be accepting *non-archival* (as well as archival) submissions! www.conll.org #CoNLL2025

Follow CoNLL at
@conll-conf.bsky.social

05.02.2025 14:15 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Here's our work accepted to #ICLR2025!

We look at how intrinsic dimension evolves over LLM layers, spotting a universal high-dimensional phase.

This ID peak is where:

- linguistic features are built
- different LLMs are most similar,

with implications for task transfer

๐Ÿงต 1/6

02.02.2025 18:46 ๐Ÿ‘ 12 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Quรจ รฉs l'aprenentatge profund ? - La Dimoni de Maxwell #deeplearning #ciencia #catalร  #barcelona
Quรจ รฉs l'aprenentatge profund ? - La Dimoni de Maxwell #deeplearning #ciencia #catalร  #barcelona YouTube video by Deep Learning Barcelona

Quรจ รฉs lโ€™aprenentatge profund ?

La @marionamec.bsky.social de @neurofregides.bsky.social ens ho explica en motiu del Deep Learning Barcelona Symposium 2024 (@dlbcn.ai), aquest dijous 19 de desembre.

#deeplearning #ciencia #catalร  #barcelona

www.youtube.com/shorts/R4u_Z...

16.12.2024 08:49 ๐Ÿ‘ 7 ๐Ÿ” 3 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1
Post image

Conclusion: for communication in-context,

Lexical systems with a soft mapping between referents and names let speakers maximize communication accuracy while minimizing complexity.

Paper: aclanthology.org/2024.emnlp-m...

3/3

02.12.2024 10:38 ๐Ÿ‘ 7 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0