🚨A Researcher's Guide to Empirical Risk Minimization
I put together a guide on regret theory for empirical risk minimization (ERM) as I understand it.
The goal was to compile results and proof techniques I’ve found useful in my own work. I hope people find it useful more broadly
26.02.2026 16:40
👍 8
🔁 3
💬 0
📌 0
Fig 4 from Zhang, Lee, Liu "Statistical Learning Theory in Lean 4: Empirical Processes from Scratch"
The dependency graph of the formalizations. Diagram shows proof of Dudley's entropy integral with preceding lemmas and Gaussian Lipschitz concentration likewise, feeding into a Gaussian Complexity inequality and an error bound for critical radius then used to prove sharp minimax error rates for linear regression.
Never have I felt more like my job will soon by taken by AI. Statistical learning theory in Lean: concentration inequalities, Dudley's entropy integral, and local Gaussian complexity bounds.
30000 lines of code, over 1000 lemmas, formalizing Wainwright and Boucheron et al arxiv.org/abs/2602.02285
03.02.2026 19:20
👍 32
🔁 5
💬 0
📌 4
By that metric, the AlphaFold paper should be about (*checks math*) 93M words long. 😉
20.01.2026 16:02
👍 3
🔁 0
💬 0
📌 0
Gemini
04.12.2025 21:54
👍 4
🔁 0
💬 0
📌 1
I'm excited to dig into this new work on numerically approximating efficient influence functions. The main idea seems to be to use a Fourier-type approximation, rather than the kernel-smoother approximation used in earlier approaches.
openreview.net/pdf/cfeab45d...
07.11.2025 15:32
👍 6
🔁 0
💬 0
📌 0
Our method can take existing generative models and use them to produce counterfactual images, text, etc.
From a technical perspective, our approach is doubly robust and can be wrapped around state of the art approaches like diffusion models, flow matching, and autoregressive language models.
24.09.2025 20:42
👍 3
🔁 0
💬 0
📌 0
Title page for paper:
DoubleGen: Debiased Generative Modeling of Counterfactuals
arXiv:2509.16842 (stat)
Alex Luedtke, Kenji Fukumizu
Selected attributes that are more common in smiling (n = 78 080) than in non-smiling (n = 84 690) CelebA faces. If a model is trained only on the smiling subset, it tends to over-produce these attributes instead of showing how the full population would look if everyone smiled.
Table:
Lipstick Makeup Female* Earrings No-beard Blonde
Smiling 56 % 47 % 65 % 26 % 88 % 18 %
Not smiling 38 % 30 % 52 % 12 % 79 % 12 %
Overall 47 % 38 % 58 % 19 % 83 % 15 %
Counterfactual smiling celebrities generated by a traditional diffusion model trained on only smiling faces (top) and a DoubleGen diffusion model (bottom). Columns contain coupled samples, with the random seed set to the same value before generation. The stars mark the most qualitatively different pairs.
What’s visible: two horizontal rows, each showing twelve AI-generated smiling portraits.
Starred columns highlight the biggest shifts: in those pairs, DoubleGen produces faces with traits under-represented among smiling faces in the original data. Non-starred columns look nearly identical between the two rows.
New paper on generative modeling of counterfactual distributions! We give a way to answer "what if" questions with generative models.
For example: what would faces look like if they were all smiling?
arxiv.org/abs/2509.16842
24.09.2025 20:42
👍 8
🔁 1
💬 1
📌 0
Same - me since I was 4. CGM is fantastic.
09.09.2025 13:35
👍 2
🔁 0
💬 1
📌 0
Carlos Cinelli, Avi Feller, Guido Imbens, Edward Kennedy, Sara Magliacane, Jose Zubizarreta
Challenges in Statistics: A Dozen Challenges in Causality and Causal Inference
https://arxiv.org/abs/2508.17099
26.08.2025 05:56
👍 10
🔁 4
💬 0
📌 0
I want to advertise some relatively recent work which I really like, and have been fortunate to play a small role in.
The paper is titled "A New Proof of Sub-Gaussian Norm Concentration Inequality" (arxiv.org/abs/2503.14347), led by Zishun Liu and Yongxin Chen at Georgia Tech.
19.08.2025 08:28
👍 36
🔁 9
💬 1
📌 0
Neat AI product for improving technical writing.
Tried it on a 50 page draft of a causal ML paper. Of its top 10 comments, 4 concerned minor technical issues I'd missed (notation error, misapplication of definition, etc.). In my experience, vanilla chatbots wouldn't have caught these.
24.07.2025 05:48
👍 5
🔁 1
💬 0
📌 0
Starting to look like I might not be able to work at Harvard anymore due to recent funding cuts. If you know of any open statistical consulting positions that support remote work or are NYC-based, please reach out! 😅
04.06.2025 19:02
👍 152
🔁 96
💬 11
📌 7
I've advised 15 PhD students—10 were international students. All graduates continue advancing U.S. excellence in research and education. Cutting off this pipeline of talent would be shortsighted.
23.05.2025 03:36
👍 8
🔁 2
💬 0
📌 0
I'm a current Harvard graduate student and I found out today that I had my NSF GRFP terminated without notification. I was awarded this individual research fellowship before even choosing Harvard as my graduate school
22.05.2025 21:38
👍 897
🔁 313
💬 45
📌 12
Had a great time presenting at #ACIC on doubly robust inference via calibration
Calibrating nuisance estimates in DML protects against model misspecification and slow convergence.
Just one line of code is all it takes.
19.05.2025 00:02
👍 19
🔁 1
💬 1
📌 2
Thanks for the pointer! We'll check it out
01.05.2025 21:37
👍 0
🔁 0
💬 0
📌 0
Our main insight is that smooth divergences - like the Sinkhorn - behave locally like an MMD, and so it suffices to compress with respect to that criterion. This insight draws from recent works studying distributional limits of Sinkhorn divergences (Goldfeld et al., Gonzalez-Sanz et al.).
30.04.2025 12:59
👍 2
🔁 0
💬 0
📌 0
We build on earlier coreset selection works that compress with respect to maximum mean discrepancy (MMD), including kernel thinning (Dwivedi and @lestermackey.bsky.social) and quadrature (Hayakawa et al.).
30.04.2025 12:59
👍 2
🔁 0
💬 3
📌 0
We pay special attention to the Sinkhorn divergence from optimal transport. Using our method, CO2, a dataset of size n can be compressed to about size log(n) without meaningful Sinkhorn error.
30.04.2025 12:59
👍 3
🔁 0
💬 1
📌 0
The Sinkhorn reconstruction error in various dimensions (left) and dataset sizes (right). In the first plot the sample size is fixed at n=25,000, and for the latter the dimension is fixed at d=10. The proposed compression method, CO2, outperforms random sampling in all settings considered.
Q-Q plots of the Sinkhorn reconstruction error (left) and l1 error between the label proportions (right) of the compressed data as compared to random samples. The proposed compression method, CO2, outperforms random sampling in all settings considered.
New paper, led by my student Alex Kokot!
We study dataset compression through coreset selection - finding a small, weighted subset of observations that preserves information with respect to some divergence.
arxiv.org/abs/2504.20194
30.04.2025 12:59
👍 11
🔁 1
💬 2
📌 0
The NIH overhead cut doesn't just hurt universities.
It's deadly to the US economy.
The US is a world leader in tech due to the ecosystem that NIH and NSF propel. It drives innovation for tech transfer, creates a highly-skilled sci/tech workforce, and fosters academic/industry crossfertilization.
08.02.2025 02:03
👍 1346
🔁 512
💬 30
📌 20
Agreed. And when misspecified, the MLE is estimating a Kullback-Leibler projection of the true distribution onto the misspecified model (and is consistent for that as n->infinity).
24.01.2025 18:13
👍 3
🔁 0
💬 1
📌 0
Thrilled to share our new paper! We introduce a generalized autoDML framework for smooth functionals in general M-estimation problems, significantly broadening the scope of problems where automatic debiasing can be applied!
22.01.2025 13:54
👍 19
🔁 7
💬 1
📌 0
Welcome, @danielawitten.bsky.social!
24.11.2024 11:04
👍 5
🔁 0
💬 0
📌 1
👋 In Tokyo this academic year, on sabbatical at the Institute of Statistical Mathematics.
In town and interested in causal ML? Would love to grab coffee and chat.
12.11.2024 10:51
👍 7
🔁 2
💬 0
📌 0
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two.
"The Elements of Differentiable Programming"
link: arxiv.org/abs/2403.14606
Basically: "autodiff - it's everywhere! what is it, and how do you use it?" seems like a good resource for anyone interested in data science, machine learning, "ai," neural nets, etc
#blueskai #stats #mlsky
02.04.2024 00:31
👍 32
🔁 13
💬 0
📌 0