Pratyush Maini (@pratyushmaini)

I have been thinking about data privacy, data curation for video models, finetuning v/s pretraining, how alignment data interacts with LLM safety, and its relation to unlearning. Also, very curious to hear what are some of the most exciting problems folks in India are working on!

10.12.2024 22:59 👍 1 🔁 0 💬 0 📌 0

I’ll also be spending time at the @datologyai.com booth to talk about how we curated our way to the best LLM training dataset! Please DM if you would like to chat. The best part about being a researcher is to share the excitement of what we have been working on with each other.

10.12.2024 22:59 👍 1 🔁 0 💬 1 📌 0

Came to #NeurIPS2024 for the research news, but staying for these incredible views. I am presenting some recent works that (I think) significantly advance the discourse on LLM memorization, training data detection; & a study on hallucinations x model collapse in diffusion models.

10.12.2024 22:59 👍 8 🔁 0 💬 1 📌 0

if you're a PhD student at CMU doing AI/ML, lmk if you want to be added to this starter pack.

(I don't belong in this list, but I don't know how to remove myself from this pack 😂)

go.bsky.app/9APVxQQ

03.12.2024 18:27 👍 14 🔁 3 💬 3 📌 0

🚀New Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674

Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!

We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
🧵👇

02.12.2024 17:58 👍 23 🔁 6 💬 1 📌 2

How to drive your research forward?

“I tested the idea we discussed last time. Here are some results. It does not work. (… awkward silence)”

Such conversations happen so many times when meetings with students. How do we move forward?

You need …

01.12.2024 22:09 👍 90 🔁 18 💬 1 📌 1

Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators | Pratyush Maini A critical examination of the risks and challenges posed by private evaluators (for example ScaleAI) in the LLM landscape, highlighting financial incentives, conflicts of interest, and prevalence of e...

5/Check out the full blog here:
pratyushmaini.github.io/blog/2024/ri...

Very eager to hear more feedback on this new piece!

27.11.2024 19:05 👍 1 🔁 0 💬 0 📌 0

4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.

27.11.2024 19:05 👍 2 🔁 1 💬 1 📌 0

3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.

27.11.2024 19:05 👍 1 🔁 0 💬 1 📌 0

2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.

27.11.2024 19:05 👍 1 🔁 0 💬 1 📌 0

1/Open LLM evals often face data contamination concerns. Private curators (like ScaleAI) have addressed this with private + expert evaluations.

We argue that this shift poses new risks including financial incentives & eval bias.
w/ @hbxnov.bsky.social

📝: pratyushmaini.github.io/blog/2024/ri... 🧵

27.11.2024 19:05 👍 6 🔁 2 💬 1 📌 0

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pret...

Medically adapted foundation models (think Med-*) turn out to be more hot air than hot stuff. Correcting for fatal flaws in evaluation, the current crop are no better on balance than generic foundation models, even on the very tasks for which benefits are claimed.
arxiv.org/abs/2411.04118

26.11.2024 18:12 👍 260 🔁 57 💬 8 📌 13

Temporally shifted data splits in membership inference can be misleading ⚠️ Be cautious when interpreting these benchmarks!

26.11.2024 18:17 👍 2 🔁 1 💬 0 📌 0

Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up? | Anshuman Suri <strong>TL;DR: No.</strong><br> A critical analysis of the EMNLP Best Paper proposing a divergence-based calibration for Membership Inference Attacks (MIAs). We explore its experimental shortcomings, ...

6/6 Please read the blog we wrote in order to avoid a byte-sized criticism of someone's hard work: www.anshumansuri.com/blog/2024/ca...

If you work on MIAs for LLMs, repeat after me: Temporally shifted benchmarks 👏 do 👏 not test membership.

26.11.2024 17:59 👍 2 🔁 0 💬 0 📌 0

From the MachineLearning community on Reddit: [D] ICML 2022 Outstanding Paper Awards 🔥 Explore this post and more from the MachineLearning community

5/6 This isn’t just a one-off issue with awards in ML. We are repeatedly seeing this concerning trend. It misguides researchers, misrepresents progress & harms trust in our field. Remember the ICML awards fiasco from a few years ago? www.reddit.com/r/MachineLea...

26.11.2024 17:59 👍 1 🔁 0 💬 1 📌 0

4/6 We re-implemented the method; tested on corrected setups & found results suggestive of a temporal shift, both via false-positives & false-negatives

Even more unfortunate, this paper cites Duan et. al. (they are aware of the flaws in the setup), yet creates a new temporally shifted MIA benchmark

26.11.2024 17:59 👍 1 🔁 0 💬 1 📌 0

LLM Dataset Inference: Did you train on my dataset? The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent wor...

3/6 This problem is already described in works of
Duan et al: arxiv.org/abs/2402.07841
Dataset Inference: arxiv.org/abs/2406.06443
Blind MIAs: arxiv.org/abs/2406.16201 (@floriantramer.bsky.social)
Meeus et al: arxiv.org/pdf/2406.17975

and others...

26.11.2024 17:59 👍 1 🔁 0 💬 1 📌 0

2/6 One of the Best Paper Awards at EMNLP went to a paper claiming successful MIAs for LLMs.

Unfortunately, the benchmarks studied are all "temporally shifted". At this point, we know very well that these benchmarks give a false sense of membership success by detecting distributional differences.

26.11.2024 17:59 👍 2 🔁 0 💬 1 📌 0

1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.

Case in point: #EMNLP2024 ’s Best Paper Award.

I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... 🧵

26.11.2024 17:59 👍 7 🔁 0 💬 1 📌 1

5/5 Check out @leavittron.bsky.social's detailed bsky thread below:
bsky.app/profile/leav...

And join us (@arimorcos.bsky.social
@agcrnz.bsky.social @alvin-d.bsky.social and many more who shaped this work)!

We are only getting started: jobs.ashbyhq.com/DatologyAI

25.11.2024 18:43 👍 6 🔁 0 💬 0 📌 0

4/5 This was no small feat.
A small team, punching far above its weight, took on giants in an extremely competitive space and delivered kick-ass results. Huge shoutout to my amazing teammates, especially Jack Urbanek & @leavittron.bsky.social —absolute legends. 🙌
Let’s keep pushing 👊

25.11.2024 18:43 👍 3 🔁 0 💬 1 📌 0

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.

3/5 How did we do it?
🎯 Carefully designed quality filters.
🔍 Deep understanding of synthetic data.
📐 Analyzing geometric properties of unsupervised data.
👀 Constantly looking at data!
It’s all in our deep dive: tinyurl.com/best-llm-data

25.11.2024 18:43 👍 2 🔁 0 💬 1 📌 0

2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu

25.11.2024 18:43 👍 4 🔁 0 💬 1 📌 0

1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what we’ve been working on!

Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.

Blog: 👉 tinyurl.com/best-llm-data 🧵

25.11.2024 18:43 👍 15 🔁 4 💬 1 📌 0

22.11.2024 04:42 👍 7 🔁 1 💬 0 📌 0

They particularly think for much longe (and make more errors) on questions that have a visual/geometric solution from my experience.

21.11.2024 17:43 👍 0 🔁 0 💬 1 📌 0

my new found guilty pleasure is watching the new reasoning models struggle by think-maxxing them with questions from JEE Advanced

21.11.2024 10:05 👍 6 🔁 0 💬 1 📌 0

context from X: One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at CMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow!
pratyushmaini.github.io/cmu-10-799

19.11.2024 09:38 👍 1 🔁 0 💬 0 📌 0

pretty excited about tomorrow's class. we will know the winner of our first red-blue team pokemon unlearning challenge. 620 more battles to go ⚔️

19.11.2024 09:38 👍 4 🔁 0 💬 1 📌 0

Pratyush Maini

Latest posts by Pratyush Maini @pratyushmaini