Computational Cosmetologist (@dferrer)

Social media has been a great source of random papers I’d disregarded/underestimated when they came out. It’s not the main source of anything for me, but it fills a gap nicely

05.03.2026 12:39 👍 1 🔁 0 💬 0 📌 0

Oh man, it's basically viral advertising too. Like, imagine the indignity of being blown-up by home-made bomb that first tells you to "Revolutionize your B2B impact with Boom.ly". Man-made horrors beyond our comprehension.

04.03.2026 21:49 👍 2 🔁 0 💬 0 📌 0

And you can run Qwen 3.5 yourself! (With the right hardware). They just give the weights away! You can do what you want. No billionaires need be party to it.

04.03.2026 19:22 👍 1 🔁 0 💬 0 📌 0

It may be generally true that “the methods that end up working will be among the more general of the ones you tried”, but “general, scalable methods are always better than more specific ones” is not.

Fin.

04.03.2026 17:59 👍 0 🔁 0 💬 0 📌 0

In the domain where the NFLT is applicable, for instance, performance on a held-out test set is *completely uncorrelated* with overall performance. These are not the sort of problems anyone ever actually encounters. They are the equivalent of trying to compress noise.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

At the extreme end, Bitter Lesson-ing can veer into NFLT-like “all search is the same” appeals which should be rejected out of hand. If this were true, progress would be impossible.

04.03.2026 17:59 👍 0 🔁 0 💬 1 📌 0

Much of the essential structure of “scalable generalist methods” was developed *by imitating human thought processes*. We’ve just decided it is good, helpful structure instead of the dumb, wasteful kind you shouldn’t use. “Avoid *bad* structure” is not a useful statement.

04.03.2026 17:59 👍 0 🔁 0 💬 1 📌 0

An objection I’ve heard to this is that there’s a difference between “essential structure for a method to work” and “unnecessary extra structure that imitates our thought processes”. The distinction between these, though, is only knowable in retrospect.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

In all three cases, the maximalist reading of The Bitter Lesson would tell us that with enough scaling, these methods would have beaten what replaced them. That never happened. I guess it still could in the long run, but to paraphrase Keynes, in the long run we’re all dead.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

Finally, a highly specific example is positional embeddings. Initial LLMs learned these from scratch with no inductive bias. Again, this worked at small scales, but was utterly dominated by theoretically motivated, domain specialized methods like RoPE. No one tries to learn shift covariance anymore.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

Our next example is GANs. GANs were the generic algorithms of the 2010s. They promised to solve virtually any problem—you didn’t even need a domain specific loss. Unfortunately (like GAs) they only performed well on the simplest problems. ImageGen only took off when we left them behind.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

To be unfair, if the maximalist reading of The Bitter Lesson were true, genetic algorithms would have found Transformer in 2003.

I still have people tell me these will make a reappearance any day now, but far less than a decade ago.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

Genetic Algorithms are *the* generalist, scalable method. The NFLT bros once loved to talk about how soon these would replace all optimization. We even had a working example in biology. They were right—if you have time for a trillion or so iterations you *can* make them work. You don’t, though.

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

First, while there are many cases of general, scalable methods beating domain specialized ones, there are plenty of cases of *too-general methods that didn’t work because they were too general*. Let’s look at a few in decreasing level of ambition:

04.03.2026 17:59 👍 1 🔁 0 💬 1 📌 0

The “our blessed homeland” meme where that the same things people often argue against on “bitter lesson” grounds are contrasted with the positive framing we give those same actions for methods that *work*. Our Essential Structure vs. Their Bitter Lesson Our visionary equivariance vs. their constricting inductive bias Our cunning generality vs. Their infeasibly vast solution space Our clever synthetic data vs. Their Desperate Data Augmentation Our exponential speedup vs. Their losing battle with Moore’s Law

“The Bitter Lesson” is a fine little essay. Anyone in software/ML/AI should read it. But I see a lot of people run with a maximalist overstatement of it, essentially: “structure and domain knowledge are always a mistake”. This is historically dubious and theoretically ungrounded. A thread:

04.03.2026 17:59 👍 5 🔁 1 💬 1 📌 0

Part of the reason I never open source (or push to release commercially) anything for agents is that I do not want the misery of owning it

04.03.2026 13:36 👍 2 🔁 0 💬 0 📌 0

i'm talking about this to be clear

this has all been an obvious idea for literally years imho

bsky.app/profile/davi...

04.03.2026 04:55 👍 42 🔁 1 💬 2 📌 0

Spent months building something like this out early last year because it was obviously a good idea—but zero support in the open ecosystem. I hope this finally takes off so I don’t have to keep supporting it internally.

04.03.2026 11:51 👍 1 🔁 0 💬 0 📌 0

Meanwhile optimal transport just naturally handles all three regimes (discrete / mutual support / disjoint support) in the same way.

04.03.2026 02:52 👍 1 🔁 0 💬 0 📌 0

For me the thing that settles it is that the distances for point masses are the distance in the underlying metric. I have seen so many people try to hack KL into working on distributions with disjoint support (doing crazy things like gaussian smoothing or adding a difference of CoM term to losses)

04.03.2026 02:52 👍 1 🔁 0 💬 1 📌 0

For me (which is probably just different intuition), it is deeply weird to have two distributions with different support always be at the same distance from each other. Practically it also means distances are dominated by low mass regions even if support is the same.

04.03.2026 02:45 👍 1 🔁 0 💬 1 📌 0

W2 for normal distributions is just the euclidean norm on (mean, std), so these should be different

04.03.2026 02:41 👍 1 🔁 0 💬 1 📌 0

Talk to your ML students about Information Geometry---before someone else does.

04.03.2026 02:18 👍 2 🔁 0 💬 0 📌 0

Friends don't let friends use the KL Divergence for distributions over spaces with a natural metric.

04.03.2026 02:14 👍 2 🔁 0 💬 1 📌 0

And here is the Wasserstein metric

04.03.2026 02:06 👍 0 🔁 0 💬 1 📌 0

So this is a geodesic under the Fisher Metric (the riemannian metric version of the KL divergence)

04.03.2026 02:06 👍 1 🔁 0 💬 1 📌 0

The KL geodesic between distributions with widely separated means and the same shape first blows up the variance, then slowly shifts the mean. The Wasserstein geodesic keeps the same shape but moves the mean in a straight path.

I’ll try to dig up an animation of this I made. It’s striking.

04.03.2026 01:56 👍 2 🔁 0 💬 1 📌 0

One big reason to hate it is that the ultra weak topology makes far more intuitive sense on distributions over metric spaces than the weak topology.

A probability mass at x=1 should be a neighbor of a mass at x=1 + epsilon. That the FIM / KL divergence doesn’t have this is a crime.

04.03.2026 01:51 👍 3 🔁 0 💬 1 📌 1

If the answer is that we expect leaders to care about their own lives more than those of their people, and thus react more aggressively to deter decapitation—I understand why *leaders* want that norm. We don’t have to agree, though.

03.03.2026 19:00 👍 1 🔁 0 💬 0 📌 0

But this is the same escalation process as any other military action. Why should leadership be out of bounds? Assuming we’ve already decided war is justified, of course.

03.03.2026 18:56 👍 3 🔁 0 💬 2 📌 0

Computational Cosmetologist

Latest posts by Computational Cosmetologist @dferrer