they should add reaction emojis to openreview
they should add reaction emojis to openreview
looks similar to Saclay this morning
in my experience convex is rarely, if ever, used in day to day life. most people who arent mathematicians seem unsure of the difference between concave and convex to begin with. not apples to apples imo since increasing is used by everyone pretty regularly, with an agreed upon meaning.
The point is that for some conferences (NeurIPS, ICML) reviews are published for rejected papers but not for withdrawn papers; I thought this might be the case. I see from your reasoning why it cannot be the explanation.
Does the conference use openreview? Maybe they are evading having the bad reviews published by withdrawing?
So when you're doing muon with weight decay to train nanoGPT you're using frank-wolfe to train a frank-wolfe machine
I missed this post but it is pure gold. www.colincornaby.me/2025/08/in-t...
Our results are for any algorithm that fits the stochastic conditional gradient framework, which includes Muon notably but also normalized SGD, sign SGD, and others (e.g., greedy coordinate descent, low-rank stuff).
Yep, none of this is affecting the loss - these regularizers are being added to the computation of the update to your parameters to better model the loss geometry, but they do not affect the loss you want to minimize (ignoring weight decay, which *does* transform unconstrained->constrained).
if we ignore the fact that muon is doing adam on some parameters and just focus on the spectral update (thats what you compute with newton schulz) then it's a special case of Scion (which means you constrain the update to be in the spectral ball, blue in the picture).
I heard that it's easier to get an h100 on Jean Zay than an a100, kind of funny. The hour multiplier for consumption (i.e. one h100 hour costs 4 credits) should take into account demand.
Come check out our #ICCV2025 poster for "Multi-modal Identity Extraction" at (Exhibit Hall I #73).
www.arxiv.org/abs/2508.09628
more evidence that frank-wolfe is all you need
you can improve your collaborators' writing clarity by being too dumb to fill in the gaps of what they've written, and arguing it must be wrong until they write it clearly enough that even you can understand.
Not all DC algorithms I should say but CCCP is equivalent to Frank-Wolfe, proceedings.neurips.cc/paper_files/...
Yeah, in this case it does change the stepsize (and therefore the dynamics) even if one assumption implies the other (this was what my collaborators told me when we were first writing our paper). I look forward to learning more about what these guys have done and how much a difference it makes.
I started to read this paper arxiv.org/abs/2510.17503 and I thought huh the analysis is so much like Frank-Wolfe, then I remembered that Frank-Wolfe and DC algorithms are dual. Probably, a Frank-Wolfe god like Jaggi knows that but it's not mentioned in the paper; I must be missing something simple.
In our L0 L1 smooth work I kept lamenting that L0 L1 smooth on a compact set (like in Frank-Wolfe) implies L smoothness, so it's kind of a pointless assumption. But, if you did the math to derive the short-step, it would give a new, slightly tweaked step size. These guys did exactly that.
Have you ever written a paper, and you see a small variation you could easily cover with your analysis etc but you don't do it? But you know if someone else did it right after, you would be upset you didn't include it? It happened to me again today! arxiv.org/abs/2510.16468
Frankenstein by Shelley
reminds me of "everyone steals, but i have taste!"
Abbas Khademi, Antonio Silveti-Falls
Adaptive Conditional Gradient Descent
https://arxiv.org/abs/2510.11440
Straight to the top of the "to read" list: arxiv.org/pdf/2510.09034
Now accepted at #NeurIPS2025 :)
In conditional gradient sliding you are using the conditional gradient algorithm to "chase" the projected Nesterov algorithm. Instead of computing the projection, you do some conditional gradient steps to approximate it. I wonder if you can do the same with FISTA/accelerated proximal point alg ?
Your wish is granted (by Sebastian Pokutta) www.pokutta.com/blog/littles...
nerd sniped by the bayesian learning rule again and still unsatisfied... ok, so you can explain a lot of DL optimization algorithms with certain approximations of various posteriors but that's kind of kicking the can down the road - the question becomes: why those approximations instead of others?
My paper on Generalized Gradient Norm Clipping & Non-Euclidean (L0, L1)-Smoothness (together with collaborators from EPFL) was accepted as an oral at NeurIPS! We extend the theory for our Scion algorithm to include gradient clipping. Read about it here arxiv.org/abs/2506.01913
Don’t most people use the word increasing in everyday life to mean strictly increasing? If your boss said your salary was increasing next year and then it stayed the same, wouldn’t you object to the use of increasing?