Arkadiy Saakyan's Avatar

Arkadiy Saakyan

@asaakyan

PhD student at Columbia University working on human-AI collaboration, AI creativity and explainability. prev. intern @GoogleDeepMind, @AmazonScience asaakyan.github.io

427
Followers
216
Following
16
Posts
11.11.2024
Joined
Posts Following

Latest posts by Arkadiy Saakyan @asaakyan

Video thumbnail

Our paper “Inferring fine-grained migration patterns across the United States” is now out in @natcomms.nature.com! We released a new, highly granular migration dataset. 1/9

05.02.2026 17:30 👍 70 🔁 27 💬 2 📌 5

Merriam-Webster’s human editors have chosen ‘slop’ as the 2025 Word of the Year.

15.12.2025 14:07 👍 24052 🔁 7282 💬 360 📌 940
Post image

Day 7 of #30DayMapChallenge asked us to think about accessibility. @gsagostini.bsky.social considers two metrics of access simultaneously: distance to a Subway and distance to the subway.

13.11.2025 14:27 👍 101 🔁 18 💬 4 📌 3

We didn’t observe this negative relationship between n-gram novelty and pragmaticality in humans, only in open-source LLMs!

09.11.2025 18:55 👍 1 🔁 0 💬 1 📌 0
Preview
Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity...

See more details in the paper!
Work with my amazing mentors & collaborators @najoung.bsky.social, @tuhinchakr.bsky.social, Smaranda Muresan
Paper link: www.arxiv.org/abs/2509.22641
Github link: github.com/asaakyan/ngr...

04.11.2025 15:08 👍 3 🔁 0 💬 0 📌 0
Post image

On OOD dataset StyleMirror, we find that LLM-Judge novelty scores are associated with expert preferences to a larger extent than a previously proposed n-gram novelty metric, Creativity Index, suggesting our operationalization yields a more aligned metric for textual creativity.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
Post image

Writing quality reward model scores are associated with both creativity and pragmaticality judgements, but are not interpretable. LLM-judge can replicate some expert novelty judgements but struggle with identifying non-pragmatic expressions.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
Post image

In a follow-up study with GPT-5 and Claude, we observe that the rate of human-judged creative expressions in AI-written text is significantly lower than in human-written text.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
Post image

Further, we find that both open source models tested, OLMo-1 and 2 of 7B and 32B size, exhibit a negative relationship between n-gram novelty and pragmaticality. As open-source LLMs try to generate text not present in data, their expressions tend to make less sense in context.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 1
Post image

N-gram novelty is not a reliable metric of creativity: over *90%* of top-quartile n-gram novelty expressions were not judged as creative. We find many examples of low n-gram novelty expressions rated creative and high n-gram novelty expressions rated as non-pragmatic.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
Post image

We recruit expert writers with MFA/MA/PhD background. They rated expressions in human- and AI-generated (from fully (code + DATA) open-source OLMo models) passages for if they make sense, are pragmatic, and are novel; they could also highlight any creative expressions.

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
Post image Post image

The standard definition of creativity states the product has to be both novel AND appropriate. Similarly, we operationalize textual creativity as human-judged expression novelty AND sensicality (making sense by itself) + pragmaticality (making sense in context).

04.11.2025 15:08 👍 2 🔁 0 💬 1 📌 0
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.

04.11.2025 15:08 👍 41 🔁 10 💬 1 📌 2
Urban Data Science & Equitable Cities | EAAMO Bridges EAAMO Bridges Urban Data Science & Equitable Cities working group: biweekly talks, paper studies, and workshops on computational urban data analysis to explore and address inequities.

Are you a researcher using computational methods to understand cities?

@mfranchi.bsky.social @jennahgosciak.bsky.social and I organize an EAAMO Bridges working group on Urban Data Science and we are looking for new members!

Fill the interest form on our page: urban-data-science-eaamo.github.io

03.09.2025 15:05 👍 8 🔁 8 💬 1 📌 1
Post image

📢 New paper: Applied interpretability 🤝 MT personalization!

We steer LLM generations to mimic human translator styles on literary novels in 7 languages. 📚

SAE steering can beat few-shot prompting, leading to better personalization while maintaining quality.

🧵1/

23.05.2025 12:23 👍 20 🔁 5 💬 2 📌 2
Preview
Understanding Figurative Meaning through Explainable Visual Entailment Large Vision-Language Models (VLMs) have demonstrated strong capabilities in tasks requiring a fine-grained understanding of literal meaning in images and text, such as visual question-answering or vi...

See more experiments and details in our paper: arxiv.org/abs/2405.01474

And come see our poster at NAACL :)
Joint work by Shreyas Kulkarni, @tuhinchakr.bsky.social, Smaranda Muresan

01.05.2025 16:30 👍 2 🔁 0 💬 0 📌 0
Post image

Even powerful models achieve only 50% explanation adequacy rate, suggesting difficulties in reasoning about figurative inputs. Hallucination & unsound reasoning are the most prominent error categories.

01.05.2025 16:30 👍 0 🔁 0 💬 1 📌 0
Post image

Our main results are:
1. VLMs struggle to generalize from literal to figurative meaning understanding (training on e-ViL only achieves random F1 on our task)
2. Figurative meaning in the image is harder to explain compared to when it is in the text
3. VLMs benefit from image data during fine-tuning

01.05.2025 16:30 👍 0 🔁 0 💬 1 📌 0
Post image

Via human-AI collaboration, we augment existing datasets for multimodal metaphors, sarcasm, and humor with entailed/contradicted captions and textual explanations. The figurative part can be in the image, caption, or both. We benchmarks a variety of models on the resulting data.

01.05.2025 16:30 👍 0 🔁 0 💬 1 📌 0
Post image

We frame the multimodal figurative meaning understanding problem as an explainable visual entailment task between an image (premise) and its caption (hypothesis). The VLM predicts whether the image entails or contradicts the caption, and shows the reasoning steps in a textual explanation.

01.05.2025 16:30 👍 0 🔁 0 💬 1 📌 0
Post image

Can vision-language models understand figurative meaning in multimodal inputs, like visual metaphors, sarcastic captions or memes? Come find out at our #NAACL2025 poster on Friday at 9am!

New task & dataset of images and captions with figurative phenomena like metaphor, idiom, sarcasm, and humor.

01.05.2025 16:30 👍 6 🔁 2 💬 1 📌 0
Video thumbnail

Migration data lets us study responses to environmental disasters, social change patterns, policy impacts, etc. But public data is too coarse, obscuring these important phenomena!

We build MIGRATE: a dataset of yearly flows between 47 billion pairs of US Census Block Groups. 1/5

28.03.2025 15:25 👍 42 🔁 18 💬 5 📌 1
Post image

People often claim they know when ChatGPT wrote something, but are they as accurate as they think?

Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯

28.01.2025 14:55 👍 189 🔁 66 💬 10 📌 19