Craig Schmidt's Avatar

Craig Schmidt

@craigschmidt.com

Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work in R&D at Kensho Technologies.

538
Followers
2,322
Following
54
Posts
24.11.2024
Joined
Posts Following

Latest posts by Craig Schmidt @craigschmidt.com

Gandalf the White. A quote for our times.

14.01.2026 01:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

The red cups are a brand called solo cups. They have always been red

01.01.2026 21:00 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I’m at @colmweb.org this week in Montreal. Come see our BoundlessBPE paper in the Wed morning poster session. Love to talk to anyone else here, especially about tokenization. #COLM2025

07.10.2025 19:24 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I believe he’s talking about Olin College of Engineering. Created from scratch as an undergraduate only school, with the first class in 2002. Kind of a Harvey Mudd of the east. Campus is near me, and they seem to attract great students.

02.10.2025 21:34 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
WordPiece can't always avoid <unk> even with ByteLevel pretokenization. Β· Issue #1863 Β· huggingface/tokenizers The ByteLevel pre-tokenizer is largely used to avoid the possibility of an <unk> token. However, there is a problem with the continuation characters in WordPiece that prevents you from adding all o...

The other is that is there isn't a way to specify an initial vocabulary with all 256 bytes including the continuation character ##. See github.com/huggingface/.... So in short, if you use their WordPiece you might get <UNK> tokens.

18.09.2025 15:42 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem StΓ©phan Tulkens' Blog

There are two different ways that the Huggingface Word Piece implementation can produce <UNK> tokens even with ByteLevel pretokenization. A nice blog post from StΓ©phan Tulkens talks about how to fix one of them, in response to a question of mine.
stephantul.github.io/blog/better-...

18.09.2025 15:42 πŸ‘ 2 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

I've been using GPT-5 on my phone (since it isn't my web account yet). I've had several bad responses with logical inconsistencies. My hot take: what if GPT-5 is mostly about saving OpenAI money on inference, which is why they are deprecating all the other models so quickly.

10.08.2025 18:12 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

@crampell.bsky.social’s post got me to thinking and…yes…Trump has apparently canceled the research grant of Judea Pearl, who is one of the world’s leading scholars, is Jewish, Israeli-American, & is vocally opposed to antisemitism, & is the father of Daniel Pearl.
www.science.org/content/arti...

03.08.2025 02:44 πŸ‘ 209 πŸ” 91 πŸ’¬ 8 πŸ“Œ 8
Stellen OBP - Georg-August-UniversitΓ€t GΓΆttingen Webseiten der Georg-August-UniversitΓ€t GΓΆttingen

Interested in multilingual tokenization in #NLP? Lisa Beinborn and I are hiring!

PhD candidate position in GΓΆttingen, Germany: www.uni-goettingen.de/de/644546.ht...

PostDoc position in Leuven, Belgium:
www.kuleuven.be/personeel/jo...

Deadline 6th of June

16.05.2025 08:23 πŸ‘ 25 πŸ” 13 πŸ’¬ 2 πŸ“Œ 2

I've posted a few papers I missed including yours here bsky.app/profile/crai.... Thomas pointed that out about 5 seconds after I posted on the discord :-)

30.07.2025 15:17 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Causal Estimation of Tokenisation Bias Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

16) Causal Estimation of Tokenisation Bias
Pietro Lesci et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Tokenisation is NP-Complete Philip Whittington, Gregor Bachmann, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

15) Tokenisation is NP-Complete
Philip Whittington et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 πŸ‘ 3 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David KaczΓ©r, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

14) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 1

And of course I missed some tokenization related papers at #ACL2025 in my previous post. Any more I should add?

30.07.2025 14:22 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages Georgy Andryushchenko, Vladimir V. Ivanov. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 2025.

13) Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Georgii Andriushchenko et al
aclanthology.org/2025.acl-srw...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Retrofitting Large Language Models with Dynamic Tokenization Darius Feher, Ivan Vulić, Benjamin Minixhofer. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

12) Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
TokAlign: Efficient Vocabulary Adaptation via Token Alignment Chong Li, Jiajun Zhang, Chengqing Zong. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

11) TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

10) Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025.

9) Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella et al
aclanthology.org/2025.acl-sho...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Adversarial Tokenization Renato Geh, Zilei Shao, Guy Van Den Broeck. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

8) Adversarial Tokenization
Renato Lui Geh et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Incorporating Domain Knowledge into Materials Tokenization Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

7) Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Beyond Text Compression: Evaluating Tokenizers Across Scales Jonas F. Lotz, AntΓ³nio V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

6) Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...

5) Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Unsupervised Morphological Tree Tokenizer Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

4) Unsupervised Morphological Tree Tokenizer
Xiang Hu et al
aclanthology.org/2025.finding...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Splintering Nonconcatenative Languages for Better Tokenization Bar Gazit, Shaltiel Shmidman, Avi Shmidman, Yuval Pinter. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

3) Splintering Nonconcatenative Languages for Better Tokenization
Yuval Pinter et al
aclanthology.org/2025.finding...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Tokenization is Sensitive to Language Variation Anna Wegmann, Dong Nguyen, David Jurgens. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

2) Tokenization is Sensitive to Language Variation
Anna Wegmann et al
aclanthology.org/2025.finding...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Byte Latent Transformer: Patches Scale Better Than Tokens Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini...

1) Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I'm sadly not at #ACL2025, but the work on tokenization seem to continue to explode. Here are the tokenization related papers I could find, in no particular order. Let me know if I missed any.

30.07.2025 14:03 πŸ‘ 11 πŸ” 4 πŸ’¬ 2 πŸ“Œ 0

Really grateful to the organizers for the recognition of our work!

19.07.2025 13:55 πŸ‘ 12 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
ICML Poster Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and FinetuningICML 2025

You’re right these results apply to general β€œbig” datasets like ThePile or RedPajama. There are several papers at ICML on weighting datasets like Chameleon (icml.cc/virtual/2025...) that could probably let you get away with less data.

17.07.2025 15:29 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0