These two assumptions are enough to prove convergence bounds! More generally, this view presents a theoretical framework that unifies existing elements of synthetic data approaches, facilitating reasoning about when they might succeed or fail.
These two assumptions are enough to prove convergence bounds! More generally, this view presents a theoretical framework that unifies existing elements of synthetic data approaches, facilitating reasoning about when they might succeed or fail.
Instead of a weak learner, we assume access to models that can perfectly model an input distribution, which we call strong learners.
But instead of iid samples, we have access to only weak information about the target distribution, i.e. weak data.
Synthetic Data is all the rage in LLM training, but why does it work? In arxiv.org/abs/2502.08924 we show how to analyze this question through the lens of boosting. Unlike boosting, however, our assumptions on the data and the learning method are inverted.
Shameless plug on dp synthetic data: research.google/blog/protect...