David Smith (@dasmiq) — bluesky.baby

LLM as Critic as Artist

06.03.2026 12:12 👍 2 🔁 0 💬 0 📌 0

When it Rains, it Pours: Modeling Media Storms and the News Ecosystem Benjamin Litterer, David Jurgens, Dallas Card. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

If you've made it this far, you might also want to check out Amber's earlier work on media storms: www.tandfonline.com/doi/abs/10.1..., or my student Ben Litterer's (@blitt.bsky.social) ACL paper on the same topic: aclanthology.org/2023.finding...

22.02.2026 18:00 👍 4 🔁 1 💬 1 📌 0

Catching Fire Appendix Cambridge Political Communication Element APPENDIX FOR CATCHING FIRE IN THE NEWS

For additional details, including coding protocols, teaching resources, and side-by-side case comparisons, you can refer to the accompanying website: www.amber-boydstun.com/catching-fir...

22.02.2026 17:59 👍 4 🔁 1 💬 1 📌 0

We also discuss additional factors that can influence the course of a storm, such as journalistic gatekeeping, attention fatigue, political activism, and strategic communication online.
For a more in-depth summary, please take a look at Jill's thread here: bsky.app/profile/jill... or read the book!

22.02.2026 17:58 👍 2 🔁 1 💬 1 📌 0

The book is build around a series of paired case studies -- similar events, where one became a full-fledged media storm, and the other did not -- such as the Titan Submersible Implosion vs. the Messenia Migrant Boat Disaster, occurring just days apart in 2023.

22.02.2026 17:58 👍 3 🔁 2 💬 1 📌 0

The heart of this work uses the fire triangle model (heat, fuel, and oxygen) as a metaphor to characterize the necessary conditions for an event to become into a media storm -- those stories that are so pervasive in the news that they are practically inescapable.

22.02.2026 17:57 👍 5 🔁 1 💬 1 📌 0

Catching Fire in the News Cambridge Core - Politics: General Interest - Catching Fire in the News

I'm a little late in sharing this news, but thanks to the extraordinary efforts of Amber Boydstun, @jilllaufer.bsky.social, and @nlpnoah.bsky.social, our book on media storms, "Catching Fire in the News", is now published and available fully open-access from Cambridge! doi.org/10.1017/9781...

22.02.2026 17:54 👍 29 🔁 8 💬 2 📌 0

Memorization vs. generalization in deep learning: implicit biases, benign overfitting, and more Or: how I learned to stop worrying and love the memorization

What is the relationship between memorization and generalization in AI? Is there a fundamental tradeoff? In infinitefaculty.substack.com/p/memorizati... I’ve reviewed some of the evolving perspectives on memorization & generalization in machine learning, from classic perspectives through LLMs.

18.02.2026 15:54 👍 133 🔁 27 💬 4 📌 5

Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien A guide to re-processing digitised collections with open-source VLM-based OCR models.

Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...

19.02.2026 14:28 👍 28 🔁 6 💬 4 📌 1

Toward an Ontological Representation of Fictional Characters | Computational Humanities Research | Cambridge Core Toward an Ontological Representation of Fictional Characters

New article! "Toward an Ontological Representation of Fictional Characters" by @antoine-bourgois.bsky.social, me, @oseminck.bsky.social & @tpoibeau.bsky.social

doi.org/10.1017/chr....

Nothing fancy here — only sweat & tears. 🧵

20.02.2026 13:35 👍 21 🔁 7 💬 1 📌 0

Symmetry in language statistics shapes the geometry of model representations Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM rep...

In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.

arxiv.org/abs/2602.150...

With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.

19.02.2026 04:20 👍 37 🔁 8 💬 1 📌 0

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training YouTube video by Google TechTalks

I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".

It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...

18.02.2026 02:05 👍 2 🔁 2 💬 0 📌 0

We also show that we are far from done, specifically for a complicated language like Old French.

But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.

17.02.2026 18:11 👍 3 🔁 1 💬 0 📌 0

Pre Editorial Normalization - a Hugging Face Space by comma-project Latin and Old French normalization of CATMuS output

We release:

📚 4.66M silver training samples
🧪 1.8k gold evaluation set huggingface.co/datasets/com...
🤖 ByT5-based model → 6.7% CER huggingface.co/comma-projec...

Try it here 👇
huggingface.co/spaces/comma...

17.02.2026 18:11 👍 4 🔁 1 💬 1 📌 0

👉 We propose Pre-Editorial Normalization (PEN):

An intermediate layer between:
📝 graphemic ATR output
📖 fully edited text

Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.

17.02.2026 18:11 👍 2 🔁 1 💬 1 📌 0

Recent ATR progress—especially with palaeographic datasets like CATMuS—has improved access to medieval sources.

But:
❌ Raw outputs are hard to use
❌ Fully normalized models over-normalize & hallucinate

There’s a methodological gap.

17.02.2026 18:11 👍 2 🔁 1 💬 1 📌 0

If I give you the text
📚 omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset

Can you find the ATR error without the manuscript ?

Probably not.

ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.

17.02.2026 18:11 👍 1 🔁 1 💬 2 📌 0

📄 New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

Thibault Clérice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905

We introduce Pre-Editorial Normalization (PEN).

🧵⬇️

17.02.2026 18:11 👍 23 🔁 9 💬 1 📌 2

Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces 📚✨ Submissions are open for one more week! We want to know what you're working on!

06.02.2026 20:21 👍 10 🔁 2 💬 1 📌 0

our open model proving out specialized rag LMs over scientific literature has been published in nature ✌🏻

congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers

www.nature.com/articles/s41...

04.02.2026 22:43 👍 44 🔁 10 💬 2 📌 2

Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!

30.01.2026 18:01 👍 5 🔁 4 💬 0 📌 0

The bias/variance tradeoff in 2026: Claude Sonnet wrote a program to solve the problem as described; Claude Opus figured out a shortcut from the example data that won't generalize.

28.01.2026 19:59 👍 4 🔁 0 💬 0 📌 0

[Job 📣] Are you curious about #AI applications in the #humanities? My Print and Probability research group (@print-and-prob.bsky.social) is hiring a postdoc! Come help us develop computational methods for identifying clandestine early modern printers!

cmu.wd5.myworkdayjobs.com/CMU/job/Pitt...

22.01.2026 22:00 👍 15 🔁 13 💬 0 📌 1

Debate me bro Statistical inference is rhetoric.

Statistical inference is a rhetoric of counts.

20.01.2026 15:49 👍 19 🔁 3 💬 1 📌 3

I had the absolute pleasure to visit @craicexeter.bsky.social, where I laid out an argument for how critical & computational scholars should lead the conversation on AI. We need to expand research on harms, interrogate corporate hype, and support people’s critical understanding these technologies

22.01.2026 16:32 👍 18 🔁 3 💬 0 📌 3

I couldn't figure out eloquent language to describe digital agents that assist with web navigation tasks, so I just wrote "click click" and if I keep this up maybe I will start referring to language generation as "word word"

22.01.2026 17:27 👍 11 🔁 1 💬 1 📌 0

Hi Honey, I’m Homo Neuricus Six Ways I'm using AI to Become More Human

A very random view into how some people* outside of tech think about and use chatbots. It's not coding, that's for sure, and some of it might sound ridiculous, but I think this kind of perspective and usage is way more common than we might assume.

*LA people (sorry, I love LA, but this is very LA)

22.01.2026 17:11 👍 7 🔁 1 💬 2 📌 0

Let's think step by step. If you could reconstruct the original page with high probability using a language model given the bag of words, you could:
1. demonstrate that bag of words models are useful, and
2. destroy the legal arguments people used to allow them to share bags of words.

20.01.2026 20:28 👍 2 🔁 0 💬 0 📌 0

CSE 598-004 - Building Small Language Models

The second new class I'm teaching is a very experimental graduate level seminar in CSE: "Building Small Language Models". I taught the grad level NLP class last semester (so fun!) but students wanted more—which of these new ideas work, and which work for SLMs? jurgens.people.si.umich.edu/CSE598-004/

19.01.2026 21:29 👍 32 🔁 9 💬 2 📌 1

For social scientists interested in LLMs for text classification/coding, the process here is potentially very helpful (even if you don't use the product itself).

Their core technique: Contradictory Example Training
Their training method: Binocular Labeling

More details in the linked post below.

15.01.2026 19:37 👍 45 🔁 13 💬 0 📌 0

David Smith

Latest posts by David Smith @dasmiq