Microsoft Research Lab - New York City - Microsoft Research
Apply for a research position at Microsoft Research New York & collaborate with academia to advance economics research, prediction markets & ML.
π¨Microsoft Research NYC is hiringπ¨
We're hiring postdocs and senior researchers in AI/ML broadly, and in specific areas like test-time scaling and science of DL. Postdoc applications due Oct 22, 2025. Senior researcher applications considered on a rolling basis.
Links to apply: aka.ms/msrnyc-jobs
18.09.2025 14:37
π 18
π 7
π¬ 0
π 1
Training with more data = better LLMs, right? π¨
False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." π₯π§΅π
arxiv.org/abs/2503.19206
1/10
26.03.2025 18:35
π 33
π 14
π¬ 1
π 1
1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to π€Ώ:
04.03.2025 20:59
π 59
π 11
π¬ 1
π 3
super happy about this preprint! we can *finally* perform efficient exploration and find near-optimal stationary policies in infinite-horizon linear MDPs, and even use it for imitation learning :) working with @neu-rips.bsky.social and @lviano.bsky.social on this was so much fun!!
20.02.2025 17:45
π 23
π 2
π¬ 2
π 1
What are the minimal supervised learning primitives required to perform RL efficiently?
New paper led by my amazing intern Dhruv Rohatgi:
Necessary and Sufficient Oracles: Toward a Computational Taxonomy for Reinforcement Learning
arxiv.org/abs/2502.08632
1/
20.02.2025 23:39
π 25
π 4
π¬ 1
π 0
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models | alphaXiv
View 3 comments: Delete the space?
Models can self-improveπ₯· by knowing they were wrongπ§ββοΈ but when can they do it?
Across LLM families, tasks and mechanisms
This ability scales with pretraining, prefers CoT, non QA tasks and more in π§΅
alphaxiv.org/abs/2412.02674
@yus167.bsky.social @shamkakade.bsky.social
ππ€
#NLP #ML
13.12.2024 23:55
π 23
π 3
π¬ 2
π 0
On Saturday I will present our LLM self-improvement paper in the workshop on Mathematics of Modern Machine Learning (M3L) and the workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM).
bsky.app/profile/yus1...
09.12.2024 19:48
π 2
π 0
π¬ 0
π 0
NeurIPS Poster The Importance of Online Data: Understanding Preference Fine-tuning via CoverageNeurIPS 2024
I will present two papers at #NeurIPS2024!
Happy to meet old and new friends and talk about all aspects of RL: data, environment structure, and reward! π
In Wed 11am-2pm poster session I will present HyPO-- best of both worlds of offline and online RLHF: neurips.cc/virtual/2024...
09.12.2024 19:48
π 9
π 2
π¬ 1
π 0
We also dive deep into the similarity and difference between different verification mechanisms. We observed the consistency, distinction and ensemble properties of the verification methods (see the summary image). (8/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
In iterative self-improvement, we observe the gap diminishes to 0 in a few iterations, resembling many previous findings. We discovered that one cause of such saturation is the degradation of the "effective diversity" of the generation due to the imperfect verifier. (7/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
However, self-improvement is not always possible on all tasks. We do not observe significant self-improvement signal on QA tasks like Natural Questions. Also, not all models can self-improve on sudoku, a canonical example of "verification is easier than generation". (6/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
Our first major result is an observational scaling law: with certain verification methods, the relative gap increases monotonically (almost linear) to the log of pretrain flops, on tasks like GSM8K and MATH. (5/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
We propose to use the performance difference between the reweighted and original responses (2-1) -- the "generation-verification gap". We also study the relative gap -- gap weighted by the error rate. Intuitively, improvement is harder if the model makes fewer mistakes. (4/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
While previous works measure self-improvement using the performance difference between the models (3-1), we found out that step 3 (distillation) introduces confounders (for example, the models can just be better at following certain formats). (3/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
We study self-improvement as the following process:
1. Model generates many candidate responses.
2. Model filters/reweights responses based on its verifications.
3. Distill the reweighted responses into a new model.
(2/9)
06.12.2024 18:02
π 1
π 0
π¬ 1
π 0
LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)
06.12.2024 18:02
π 12
π 4
π¬ 1
π 1
Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
https://arxiv.org/abs/2412.02674
04.12.2024 09:09
π 1
π 1
π¬ 0
π 0
I think the main difference in terms of interpolation / extrapolation between DPO and RLHF is that the former only guarantees closeness to the reference policy on the training data, while RLHF usually tacks on an on-policy KL penalty. We explored this point in arxiv.org/abs/2406.01462.
22.11.2024 15:38
π 4
π 1
π¬ 1
π 0
(1/n) π‘How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization stepsβuntil we hit CBS, beyond which returns diminish.
22.11.2024 20:19
π 16
π 4
π¬ 2
π 0
I created a starter pack for people who are or have been affiliated with the Machine Learning Department at CMU. Let me know if I missed someone!
go.bsky.app/QLTVEph
#AcademicSky
18.11.2024 15:46
π 4
π 4
π¬ 0
π 0
Ojash Neopane, Aaditya Ramdas, Aarti Singh
Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment Effect
https://arxiv.org/abs/2411.14341
22.11.2024 05:01
π 5
π 1
π¬ 0
π 0
Intro π¦
I am a final-year PhD student from CMU Robotics. I work on humanoid control, perception, and behavior in both simulation and real life, using mostly RL:
ππ»PHC: zhengyiluo.com/PHC
π«PULSE: zhengyiluo.com/PULSE
π©Omnigrasp: zhengyiluo.com/Omnigrasp
π€OmniH2O: omni.human2humanoid.com
19.11.2024 20:34
π 22
π 3
π¬ 2
π 0
Hi Bsky people π I'm a PhD candidate in Machine Learning at Carnegie Mellon University.
My research focuses on interactive AI, involving:
π€ reinforcement learning,
π§ foundation models, and
π©βπ» human-centered AI.
Also a founding co-organizer of the MineRL competitions π€ Follow me for ML updates!
18.11.2024 15:05
π 70
π 6
π¬ 2
π 0