Alexander Doria (@dorialexander)

yes yes i know, it's just for synth pipelines i could use X thousand different bayesian estimates advancing in parallel.

07.03.2026 22:05 👍 1 🔁 0 💬 0 📌 0

hmm. even with like launching many processes in parallel? could definitely have use cases for this.

07.03.2026 21:58 👍 0 🔁 0 💬 1 📌 0

Actually very much looking to revisit BRMS & co. It was dope (and should compile much better on modern gpus)

07.03.2026 21:48 👍 1 🔁 0 💬 1 📌 0

Forget about this, Bayesian is hot again.

07.03.2026 21:47 👍 1 🔁 0 💬 1 📌 0

En tout cas on se dirige clairement vers de l'entraînement sans texte mais toujours avec des données structurées. Là je suis en train de monter des environnements synthétiques exclusivement à partir de Wikidata.

07.03.2026 20:46 👍 0 🔁 0 💬 0 📌 0

Il y a peut-être une suite pour bientôt.

07.03.2026 16:39 👍 1 🔁 0 💬 0 📌 0

Pas de souci. Public surtout international maintenant…

07.03.2026 16:33 👍 0 🔁 0 💬 1 📌 0

(j'avoue que j'hésite à sortir le sujet aux ayants-droits - on a encore une paix royale pour l'instant)

07.03.2026 16:31 👍 1 🔁 0 💬 1 📌 0

Sur l'usage du synthétique pour l'entraînement ? Bah pas mal de choses cité dans mon billet (dont virtuellement tous les model report un peu récents/un minimum ouverts sur la question des données). vintagedata.org/blog/posts/s...

07.03.2026 16:30 👍 0 🔁 0 💬 2 📌 0

Synthetic Pretraining | Vintage Data Old data, new models

More of a theory sketch than a paper but it seems to hold vintagedata.org/blog/posts/s...

(I might have something more soon)

07.03.2026 15:04 👍 8 🔁 0 💬 2 📌 0

additional nail in the coffin of model collapse: better results so far on a model retrained on its own synthetic traces.

07.03.2026 13:17 👍 61 🔁 6 💬 2 📌 0

Oui bon je lance les trucs et je regarde si ça marche/données font sens. On a clairement passé un cap ces derniers mois…

07.03.2026 11:06 👍 0 🔁 0 💬 0 📌 0

(Claude Code fonctionne maintenant très bien pour générer le script d’inférence. Quasi arrêté de programmer en direct ce mois-ci)

07.03.2026 10:57 👍 2 🔁 0 💬 1 📌 0

Deux modèles d’OCR sur HuggingFace :) Ouvert/open source donc ça tourne en local — même si en pratique peut-être plus simple de faire tourner sur Colab.

07.03.2026 10:55 👍 2 🔁 0 💬 1 📌 0

Peut-être overkill, mais dots ocr (je suis en train de processer tout HAL avec) ou Lighton-OCR. Très fiable, gère aussi toute la partie layout.

07.03.2026 10:43 👍 1 🔁 0 💬 1 📌 0

Jamais réussi à lire non plus. Et même sentiment : pas vraiment de vie là-dedans.

05.03.2026 21:17 👍 0 🔁 0 💬 1 📌 0

oh yes, obviously, i can make this now

05.03.2026 00:05 👍 18 🔁 1 💬 1 📌 0

who talk about cleanly?

04.03.2026 22:10 👍 4 🔁 0 💬 1 📌 0

Well 10 years of teaching it… Likely last time.

04.03.2026 22:09 👍 6 🔁 0 💬 1 📌 0

I guess Donald Knuth must have thought of that :)

04.03.2026 21:02 👍 0 🔁 0 💬 2 📌 0

just realized that jupyter is probably dead as a concept. it's all md+scripts now.

04.03.2026 20:35 👍 82 🔁 8 💬 9 📌 7

more seriously: i still think "computation" is also happening internally (just in a smooth/transient way, not that dissimilar to actual math search prior formal verification)

04.03.2026 00:39 👍 8 🔁 0 💬 1 📌 0

I'm afraid this is anthropomorphizing. The proof was there all along in future training data.

03.03.2026 23:47 👍 50 🔁 3 💬 2 📌 1

Nothing to see, just very powerful pattern matching. www-cs-faculty.stanford.edu/~knuth/paper...

03.03.2026 23:36 👍 216 🔁 44 💬 11 📌 20

actually, yes.

03.03.2026 07:25 👍 1 🔁 0 💬 0 📌 0

Not sure for the US, but in Europe started very early on (even q3 2023) with their positioning on safety/alignment and avoiding the mess openai got into at the same time (GDPR blocks, etc.)

03.03.2026 00:29 👍 1 🔁 0 💬 0 📌 0

(Our next release will actually be personas)

02.03.2026 22:59 👍 6 🔁 1 💬 1 📌 0

Would also open up the much more interesting question of how to design and tune personas. I’m currently switching to agentic model training and simulated personas are everywhere, one of the absolute core original seed.

02.03.2026 22:58 👍 18 🔁 1 💬 1 📌 0

Oh been part-time there for a while now. Always good to have a platform plan b.

02.03.2026 20:20 👍 3 🔁 0 💬 1 📌 0

Models should design, models should populate, models should compile.

02.03.2026 17:46 👍 14 🔁 1 💬 0 📌 0

Alexander Doria

Latest posts by Alexander Doria @dorialexander