David Smith's Avatar

David Smith

@dasmiq

Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.

5,352
Followers
299
Following
393
Posts
01.09.2023
Joined
Posts Following

Latest posts by David Smith @dasmiq

LLM as Critic as Artist

06.03.2026 12:12 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
When it Rains, it Pours: Modeling Media Storms and the News Ecosystem Benjamin Litterer, David Jurgens, Dallas Card. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

If you've made it this far, you might also want to check out Amber's earlier work on media storms: www.tandfonline.com/doi/abs/10.1..., or my student Ben Litterer's (@blitt.bsky.social) ACL paper on the same topic: aclanthology.org/2023.finding...

22.02.2026 18:00 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Catching Fire Appendix Cambridge Political Communication Element APPENDIX FOR Β  CATCHING FIRE IN THE NEWS

For additional details, including coding protocols, teaching resources, and side-by-side case comparisons, you can refer to the accompanying website: www.amber-boydstun.com/catching-fir...

22.02.2026 17:59 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

We also discuss additional factors that can influence the course of a storm, such as journalistic gatekeeping, attention fatigue, political activism, and strategic communication online.
For a more in-depth summary, please take a look at Jill's thread here: bsky.app/profile/jill... or read the book!

22.02.2026 17:58 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

The book is build around a series of paired case studies -- similar events, where one became a full-fledged media storm, and the other did not -- such as the Titan Submersible Implosion vs. the Messenia Migrant Boat Disaster, occurring just days apart in 2023.

22.02.2026 17:58 πŸ‘ 3 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

The heart of this work uses the fire triangle model (heat, fuel, and oxygen) as a metaphor to characterize the necessary conditions for an event to become into a media storm -- those stories that are so pervasive in the news that they are practically inescapable.

22.02.2026 17:57 πŸ‘ 5 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Catching Fire in the News Cambridge Core - Politics: General Interest - Catching Fire in the News

I'm a little late in sharing this news, but thanks to the extraordinary efforts of Amber Boydstun, @jilllaufer.bsky.social, and @nlpnoah.bsky.social, our book on media storms, "Catching Fire in the News", is now published and available fully open-access from Cambridge! doi.org/10.1017/9781...

22.02.2026 17:54 πŸ‘ 29 πŸ” 8 πŸ’¬ 2 πŸ“Œ 0
Preview
Memorization vs. generalization in deep learning: implicit biases, benign overfitting, and more Or: how I learned to stop worrying and love the memorization

What is the relationship between memorization and generalization in AI? Is there a fundamental tradeoff? In infinitefaculty.substack.com/p/memorizati... I’ve reviewed some of the evolving perspectives on memorization & generalization in machine learning, from classic perspectives through LLMs.

18.02.2026 15:54 πŸ‘ 133 πŸ” 27 πŸ’¬ 4 πŸ“Œ 5
Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien A guide to re-processing digitised collections with open-source VLM-based OCR models.

Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...

19.02.2026 14:28 πŸ‘ 28 πŸ” 6 πŸ’¬ 4 πŸ“Œ 1
Toward an Ontological Representation of Fictional Characters | Computational Humanities Research | Cambridge Core Toward an Ontological Representation of Fictional Characters

New article! "Toward an Ontological Representation of Fictional Characters" by @antoine-bourgois.bsky.social, me, @oseminck.bsky.social & @tpoibeau.bsky.social

doi.org/10.1017/chr....

Nothing fancy here β€” only sweat & tears. 🧡

20.02.2026 13:35 πŸ‘ 21 πŸ” 7 πŸ’¬ 1 πŸ“Œ 0
Preview
Symmetry in language statistics shapes the geometry of model representations Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM rep...

In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.

arxiv.org/abs/2602.150...

With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.

19.02.2026 04:20 πŸ‘ 37 πŸ” 8 πŸ’¬ 1 πŸ“Œ 0
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training YouTube video by Google TechTalks

I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".

It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...

18.02.2026 02:05 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

We also show that we are far from done, specifically for a complicated language like Old French.

But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.

17.02.2026 18:11 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Pre Editorial Normalization - a Hugging Face Space by comma-project Latin and Old French normalization of CATMuS output

We release:

πŸ“š 4.66M silver training samples
πŸ§ͺ 1.8k gold evaluation set huggingface.co/datasets/com...
πŸ€– ByT5-based model β†’ 6.7% CER huggingface.co/comma-projec...

Try it here πŸ‘‡
huggingface.co/spaces/comma...

17.02.2026 18:11 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

πŸ‘‰ We propose Pre-Editorial Normalization (PEN):

An intermediate layer between:
πŸ“ graphemic ATR output
πŸ“– fully edited text

Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.

17.02.2026 18:11 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Recent ATR progressβ€”especially with palaeographic datasets like CATMuSβ€”has improved access to medieval sources.

But:
❌ Raw outputs are hard to use
❌ Fully normalized models over-normalize & hallucinate

There’s a methodological gap.

17.02.2026 18:11 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

If I give you the text
πŸ“š omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset

Can you find the ATR error without the manuscript ?

Probably not.

ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.

17.02.2026 18:11 πŸ‘ 1 πŸ” 1 πŸ’¬ 2 πŸ“Œ 0
Post image

πŸ“„ New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

Thibault ClΓ©rice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905

We introduce Pre-Editorial Normalization (PEN).

πŸ§΅β¬‡οΈ

17.02.2026 18:11 πŸ‘ 23 πŸ” 9 πŸ’¬ 1 πŸ“Œ 2

Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces πŸ“šβœ¨ Submissions are open for one more week! We want to know what you're working on!

06.02.2026 20:21 πŸ‘ 10 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

our open model proving out specialized rag LMs over scientific literature has been published in nature ✌🏻

congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers

www.nature.com/articles/s41...

04.02.2026 22:43 πŸ‘ 44 πŸ” 10 πŸ’¬ 2 πŸ“Œ 2

Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!

30.01.2026 18:01 πŸ‘ 5 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0

The bias/variance tradeoff in 2026: Claude Sonnet wrote a program to solve the problem as described; Claude Opus figured out a shortcut from the example data that won't generalize.

28.01.2026 19:59 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

[Job πŸ“£] Are you curious about #AI applications in the #humanities? My Print and Probability research group (@print-and-prob.bsky.social) is hiring a postdoc! Come help us develop computational methods for identifying clandestine early modern printers!

cmu.wd5.myworkdayjobs.com/CMU/job/Pitt...

22.01.2026 22:00 πŸ‘ 15 πŸ” 13 πŸ’¬ 0 πŸ“Œ 1
Preview
Debate me bro Statistical inference is rhetoric.

Statistical inference is a rhetoric of counts.

20.01.2026 15:49 πŸ‘ 19 πŸ” 3 πŸ’¬ 1 πŸ“Œ 3
Post image

I had the absolute pleasure to visit @craicexeter.bsky.social, where I laid out an argument for how critical & computational scholars should lead the conversation on AI. We need to expand research on harms, interrogate corporate hype, and support people’s critical understanding these technologies

22.01.2026 16:32 πŸ‘ 18 πŸ” 3 πŸ’¬ 0 πŸ“Œ 3

I couldn't figure out eloquent language to describe digital agents that assist with web navigation tasks, so I just wrote "click click" and if I keep this up maybe I will start referring to language generation as "word word"

22.01.2026 17:27 πŸ‘ 11 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Hi Honey, I’m Homo Neuricus Six Ways I'm using AI to Become More Human

A very random view into how some people* outside of tech think about and use chatbots. It's not coding, that's for sure, and some of it might sound ridiculous, but I think this kind of perspective and usage is way more common than we might assume.

*LA people (sorry, I love LA, but this is very LA)

22.01.2026 17:11 πŸ‘ 7 πŸ” 1 πŸ’¬ 2 πŸ“Œ 0

Let's think step by step. If you could reconstruct the original page with high probability using a language model given the bag of words, you could:
1. demonstrate that bag of words models are useful, and
2. destroy the legal arguments people used to allow them to share bags of words.

20.01.2026 20:28 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
CSE 598-004 - Building Small Language Models

The second new class I'm teaching is a very experimental graduate level seminar in CSE: "Building Small Language Models". I taught the grad level NLP class last semester (so fun!) but students wanted moreβ€”which of these new ideas work, and which work for SLMs? jurgens.people.si.umich.edu/CSE598-004/

19.01.2026 21:29 πŸ‘ 32 πŸ” 9 πŸ’¬ 2 πŸ“Œ 1

For social scientists interested in LLMs for text classification/coding, the process here is potentially very helpful (even if you don't use the product itself).

Their core technique: Contradictory Example Training
Their training method: Binocular Labeling

More details in the linked post below.

15.01.2026 19:37 πŸ‘ 45 πŸ” 13 πŸ’¬ 0 πŸ“Œ 0