I have been thinking about data privacy, data curation for video models, finetuning v/s pretraining, how alignment data interacts with LLM safety, and its relation to unlearning. Also, very curious to hear what are some of the most exciting problems folks in India are working on!
10.12.2024 22:59
π 1
π 0
π¬ 0
π 0
Iβll also be spending time at the @datologyai.com booth to talk about how we curated our way to the best LLM training dataset! Please DM if you would like to chat. The best part about being a researcher is to share the excitement of what we have been working on with each other.
10.12.2024 22:59
π 1
π 0
π¬ 1
π 0
Came to #NeurIPS2024 for the research news, but staying for these incredible views. I am presenting some recent works that (I think) significantly advance the discourse on LLM memorization, training data detection; & a study on hallucinations x model collapse in diffusion models.
10.12.2024 22:59
π 8
π 0
π¬ 1
π 0
if you're a PhD student at CMU doing AI/ML, lmk if you want to be added to this starter pack.
(I don't belong in this list, but I don't know how to remove myself from this pack π)
go.bsky.app/9APVxQQ
03.12.2024 18:27
π 14
π 3
π¬ 3
π 0
πNew Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674
Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!
We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
π§΅π
02.12.2024 17:58
π 23
π 6
π¬ 1
π 2
How to drive your research forward?
βI tested the idea we discussed last time. Here are some results. It does not work. (β¦ awkward silence)β
Such conversations happen so many times when meetings with students. How do we move forward?
You need β¦
01.12.2024 22:09
π 90
π 18
π¬ 1
π 1
4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.
27.11.2024 19:05
π 2
π 1
π¬ 1
π 0
3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.
"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.
27.11.2024 19:05
π 1
π 0
π¬ 1
π 0
2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:
(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.
27.11.2024 19:05
π 1
π 0
π¬ 1
π 0
1/Open LLM evals often face data contamination concerns. Private curators (like ScaleAI) have addressed this with private + expert evaluations.
We argue that this shift poses new risks including financial incentives & eval bias.
w/ @hbxnov.bsky.social
π: pratyushmaini.github.io/blog/2024/ri... π§΅
27.11.2024 19:05
π 6
π 2
π¬ 1
π 0
Temporally shifted data splits in membership inference can be misleading β οΈ Be cautious when interpreting these benchmarks!
26.11.2024 18:17
π 2
π 1
π¬ 0
π 0
From the MachineLearning community on Reddit: [D] ICML 2022 Outstanding Paper Awards π₯
Explore this post and more from the MachineLearning community
5/6 This isnβt just a one-off issue with awards in ML. We are repeatedly seeing this concerning trend. It misguides researchers, misrepresents progress & harms trust in our field. Remember the ICML awards fiasco from a few years ago? www.reddit.com/r/MachineLea...
26.11.2024 17:59
π 1
π 0
π¬ 1
π 0
4/6 We re-implemented the method; tested on corrected setups & found results suggestive of a temporal shift, both via false-positives & false-negatives
Even more unfortunate, this paper cites Duan et. al. (they are aware of the flaws in the setup), yet creates a new temporally shifted MIA benchmark
26.11.2024 17:59
π 1
π 0
π¬ 1
π 0
2/6 One of the Best Paper Awards at EMNLP went to a paper claiming successful MIAs for LLMs.
Unfortunately, the benchmarks studied are all "temporally shifted". At this point, we know very well that these benchmarks give a false sense of membership success by detecting distributional differences.
26.11.2024 17:59
π 2
π 0
π¬ 1
π 0
1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.
Case in point: #EMNLP2024 βs Best Paper Award.
I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... π§΅
26.11.2024 17:59
π 7
π 0
π¬ 1
π 1
5/5 Check out @leavittron.bsky.social's detailed bsky thread below:
bsky.app/profile/leav...
And join us (@arimorcos.bsky.social
@agcrnz.bsky.social @alvin-d.bsky.social and many more who shaped this work)!
We are only getting started: jobs.ashbyhq.com/DatologyAI
25.11.2024 18:43
π 6
π 0
π¬ 0
π 0
4/5 This was no small feat.
A small team, punching far above its weight, took on giants in an extremely competitive space and delivered kick-ass results. Huge shoutout to my amazing teammates, especially Jack Urbanek & @leavittron.bsky.social βabsolute legends. π
Letβs keep pushing π
25.11.2024 18:43
π 3
π 0
π¬ 1
π 0
Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset
Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.
3/5 How did we do it?
π― Carefully designed quality filters.
π Deep understanding of synthetic data.
π Analyzing geometric properties of unsupervised data.
π Constantly looking at data!
Itβs all in our deep dive: tinyurl.com/best-llm-data
25.11.2024 18:43
π 2
π 0
π¬ 1
π 0
2/5 π₯Resultsπ₯ We smashed past results, beating both DCLM and FW-Edu by significant margins. π
Our models trained on curated data saw:
β’ 4.4% better than DCLM.
β’ 2x faster training than FW-edu
β’ Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu
25.11.2024 18:43
π 4
π 0
π¬ 1
π 0
1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what weβve been working on!
Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.
Blog: π tinyurl.com/best-llm-data π§΅
25.11.2024 18:43
π 15
π 4
π¬ 1
π 0
22.11.2024 04:42
π 7
π 1
π¬ 0
π 0
They particularly think for much longe (and make more errors) on questions that have a visual/geometric solution from my experience.
21.11.2024 17:43
π 0
π 0
π¬ 1
π 0
my new found guilty pleasure is watching the new reasoning models struggle by think-maxxing them with questions from JEE Advanced
21.11.2024 10:05
π 6
π 0
π¬ 1
π 0
context from X: One of my dreams when I started my PhD was to teach my own course. I am very excited that I'm getting a chance to create & teach a new "gamified" course at CMU this Fall. 10-799: Data Privacy, Memorization & Copyright in GenAI starts tomorrow!
pratyushmaini.github.io/cmu-10-799
19.11.2024 09:38
π 1
π 0
π¬ 0
π 0
pretty excited about tomorrow's class. we will know the winner of our first red-blue team pokemon unlearning challenge. 620 more battles to go βοΈ
19.11.2024 09:38
π 4
π 0
π¬ 1
π 0