Sergei Vassilvitskii (@vsergei)

These two assumptions are enough to prove convergence bounds! More generally, this view presents a theoretical framework that unifies existing elements of synthetic data approaches, facilitating reasoning about when they might succeed or fail.

14.02.2025 13:48 👍 0 🔁 0 💬 0 📌 0

Instead of a weak learner, we assume access to models that can perfectly model an input distribution, which we call strong learners.

But instead of iid samples, we have access to only weak information about the target distribution, i.e. weak data.

14.02.2025 13:48 👍 0 🔁 0 💬 1 📌 0

Escaping Collapse: The Strength of Weak Data for Large Language Model Training Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper...

Synthetic Data is all the rage in LLM training, but why does it work? In arxiv.org/abs/2502.08924 we show how to analyze this question through the lens of boosting. Unlike boosting, however, our assumptions on the data and the learning method are inverted.

14.02.2025 13:48 👍 8 🔁 3 💬 1 📌 0

Protecting users with differentially private synthetic training data

Shameless plug on dp synthetic data: research.google/blog/protect...

21.12.2024 20:00 👍 2 🔁 0 💬 0 📌 0

Sergei Vassilvitskii

Latest posts by Sergei Vassilvitskii @vsergei