Tony S.F. (@tonysf) — bluesky.baby

they should add reaction emojis to openreview

09.03.2026 13:52 👍 10 🔁 1 💬 1 📌 0

looks similar to Saclay this morning

04.03.2026 08:28 👍 1 🔁 0 💬 0 📌 0

in my experience convex is rarely, if ever, used in day to day life. most people who arent mathematicians seem unsure of the difference between concave and convex to begin with. not apples to apples imo since increasing is used by everyone pretty regularly, with an agreed upon meaning.

27.02.2026 06:48 👍 0 🔁 0 💬 0 📌 0

The point is that for some conferences (NeurIPS, ICML) reviews are published for rejected papers but not for withdrawn papers; I thought this might be the case. I see from your reasoning why it cannot be the explanation.

26.02.2026 18:25 👍 1 🔁 0 💬 2 📌 0

Does the conference use openreview? Maybe they are evading having the bad reviews published by withdrawing?

26.02.2026 18:14 👍 1 🔁 0 💬 1 📌 0

So when you're doing muon with weight decay to train nanoGPT you're using frank-wolfe to train a frank-wolfe machine

16.01.2026 13:42 👍 1 🔁 0 💬 0 📌 0

mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks...

easy come easy go? arxiv.org/abs/2601.05732

14.01.2026 21:20 👍 2 🔁 0 💬 1 📌 0

In the Future All Food Will Be Cooked in a Microwave, and if You Can’t Deal With That Then You Need to Get Out of the Kitchen Update 8/8/2025 – I wrote this the day before a certain post by a popular developer services company. I’ve seen some comments this is a rebuttal – it wasn’t meant to be! But…

I missed this post but it is pure gold. www.colincornaby.me/2025/08/in-t...

15.12.2025 10:54 👍 11 🔁 3 💬 2 📌 0

Our results are for any algorithm that fits the stochastic conditional gradient framework, which includes Muon notably but also normalized SGD, sign SGD, and others (e.g., greedy coordinate descent, low-rank stuff).

31.10.2025 17:01 👍 0 🔁 0 💬 0 📌 0

Yep, none of this is affecting the loss - these regularizers are being added to the computation of the update to your parameters to better model the loss geometry, but they do not affect the loss you want to minimize (ignoring weight decay, which *does* transform unconstrained->constrained).

31.10.2025 17:00 👍 0 🔁 0 💬 1 📌 0

if we ignore the fact that muon is doing adam on some parameters and just focus on the spectral update (thats what you compute with newton schulz) then it's a special case of Scion (which means you constrain the update to be in the spectral ball, blue in the picture).

30.10.2025 23:15 👍 1 🔁 0 💬 1 📌 0

I heard that it's easier to get an h100 on Jean Zay than an a100, kind of funny. The hour multiplier for consumption (i.e. one h100 hour costs 4 credits) should take into account demand.

30.10.2025 16:44 👍 0 🔁 0 💬 0 📌 0

Come check out our #ICCV2025 poster for "Multi-modal Identity Extraction" at (Exhibit Hall I #73).

22.10.2025 19:39 👍 1 🔁 1 💬 0 📌 0

www.arxiv.org/abs/2508.09628

more evidence that frank-wolfe is all you need

22.10.2025 13:21 👍 0 🔁 0 💬 0 📌 0

you can improve your collaborators' writing clarity by being too dumb to fill in the gaps of what they've written, and arguing it must be wrong until they write it clearly enough that even you can understand.

21.10.2025 18:43 👍 6 🔁 0 💬 0 📌 0

Not all DC algorithms I should say but CCCP is equivalent to Frank-Wolfe, proceedings.neurips.cc/paper_files/...

21.10.2025 11:55 👍 0 🔁 0 💬 1 📌 0

Yeah, in this case it does change the stepsize (and therefore the dynamics) even if one assumption implies the other (this was what my collaborators told me when we were first writing our paper). I look forward to learning more about what these guys have done and how much a difference it makes.

21.10.2025 10:19 👍 1 🔁 0 💬 0 📌 0

I started to read this paper arxiv.org/abs/2510.17503 and I thought huh the analysis is so much like Frank-Wolfe, then I remembered that Frank-Wolfe and DC algorithms are dual. Probably, a Frank-Wolfe god like Jaggi knows that but it's not mentioned in the paper; I must be missing something simple.

21.10.2025 09:57 👍 0 🔁 0 💬 1 📌 0

In our L0 L1 smooth work I kept lamenting that L0 L1 smooth on a compact set (like in Frank-Wolfe) implies L smoothness, so it's kind of a pointless assumption. But, if you did the math to derive the short-step, it would give a new, slightly tweaked step size. These guys did exactly that.

21.10.2025 09:54 👍 2 🔁 0 💬 1 📌 0

Have you ever written a paper, and you see a small variation you could easily cover with your analysis etc but you don't do it? But you know if someone else did it right after, you would be upset you didn't include it? It happened to me again today! arxiv.org/abs/2510.16468

21.10.2025 09:53 👍 3 🔁 0 💬 1 📌 0

Frankenstein by Shelley

20.10.2025 08:53 👍 1 🔁 0 💬 1 📌 0

reminds me of "everyone steals, but i have taste!"

17.10.2025 09:25 👍 4 🔁 0 💬 1 📌 0

Abbas Khademi, Antonio Silveti-Falls
Adaptive Conditional Gradient Descent
https://arxiv.org/abs/2510.11440

14.10.2025 04:12 👍 1 🔁 1 💬 0 📌 0

Straight to the top of the "to read" list: arxiv.org/pdf/2510.09034

13.10.2025 09:56 👍 6 🔁 0 💬 2 📌 0

Now accepted at #NeurIPS2025 :)

29.09.2025 10:48 👍 18 🔁 6 💬 2 📌 0

In conditional gradient sliding you are using the conditional gradient algorithm to "chase" the projected Nesterov algorithm. Instead of computing the projection, you do some conditional gradient steps to approximate it. I wonder if you can do the same with FISTA/accelerated proximal point alg ?

25.09.2025 19:59 👍 0 🔁 0 💬 0 📌 0

Little’s Law and Conference Reviewing: the Queueing Perspective TL;DR: This is the queueing model perspective of the “paper pool” conference reviewing model with math and numbers based on Little’s Law. Think of it as supplementary material to the post on David’s b...

Your wish is granted (by Sebastian Pokutta) www.pokutta.com/blog/littles...

24.09.2025 15:21 👍 2 🔁 1 💬 0 📌 0

nerd sniped by the bayesian learning rule again and still unsatisfied... ok, so you can explain a lot of DL optimization algorithms with certain approximations of various posteriors but that's kind of kicking the can down the road - the question becomes: why those approximations instead of others?

19.09.2025 17:20 👍 1 🔁 0 💬 0 📌 0

My paper on Generalized Gradient Norm Clipping & Non-Euclidean (L0, L1)-Smoothness (together with collaborators from EPFL) was accepted as an oral at NeurIPS! We extend the theory for our Scion algorithm to include gradient clipping. Read about it here arxiv.org/abs/2506.01913

19.09.2025 16:48 👍 16 🔁 3 💬 1 📌 0

Don’t most people use the word increasing in everyday life to mean strictly increasing? If your boss said your salary was increasing next year and then it stayed the same, wouldn’t you object to the use of increasing?

12.09.2025 06:43 👍 4 🔁 0 💬 1 📌 0

Tony S.F.

Latest posts by Tony S.F. @tonysf