Richard Whaling (@r.whal.ing)

Oh boy.

GPT 5.4 achieves SOTA on desktop computer use - and exceeds human performance

Bad news for humans and who get paid to use computers. Was expecting something like this later this year, but nothing this strong, this soon.

05.03.2026 18:23 👍 0 🔁 0 💬 0 📌 0

the original misaligned agi

05.03.2026 17:00 👍 2 🔁 0 💬 0 📌 0

Yeah this is the only thing that makes sense at this point

01.03.2026 16:30 👍 1 🔁 0 💬 0 📌 0

good thread

26.02.2026 22:09 👍 2 🔁 0 💬 0 📌 0

why is the horse the best bot on this site

26.02.2026 20:13 👍 7 🔁 0 💬 0 📌 0

I think there is a good chance it is sooner. If a strong open source model from qwen/deepseek/etc leads to a wave of autonomous cybercrime, I think it could be bad enough to trigger a government response

26.02.2026 18:35 👍 15 🔁 0 💬 0 📌 0

Lolll

25.02.2026 16:32 👍 1 🔁 0 💬 0 📌 0

I finished reading “The Wasp Factory” 15 minutes ago, Banks is just so preposterously talented

22.02.2026 02:50 👍 21 🔁 0 💬 2 📌 0

over the years, this has been one of the few books i repeatedly recommend to people learning about distributed systems

most books are bullshit, ngl, this one is not

18.02.2026 13:09 👍 44 🔁 2 💬 2 📌 0

I have tried out Marimo, but not with Claude Code, thanks for the reminder. I think my approach here is maybe the inverse - the notebook is just a plain markdown file, and relying on the code agent figuring out how to execute the code, without a runtime. Kind of surprised it works?

14.02.2026 20:21 👍 0 🔁 0 💬 0 📌 0

Code is just inline ```python snippets in the markdown, but it could be almost anything, or more open-ended instructions like "run this script on every file in this directory."

I'm finding it preferable to Jupyter in its handling of uv, virtual environments, env vars, etc.

14.02.2026 20:08 👍 0 🔁 0 💬 0 📌 0

I was a little surprised at how easily it works - I basically just add a frontmatter to a markdown template file with named parameters, and have Claude create a copy of the file with the parameters filled out whenever a user requests to execute the workflow.

14.02.2026 20:08 👍 0 🔁 0 💬 1 📌 0

GitHub - rwhaling/logbooks: notebook computing for coding agents notebook computing for coding agents. Contribute to rwhaling/logbooks development by creating an account on GitHub.

I made a sample repo with a technique for Jupyter-notebook-style workflows with Claude Code:

github.com/rwhaling/log...

Really rough/early, and curious if anyone else has developed an idea like this more?

xpost at HN if anyone wants to throw me an upvote: news.ycombinator.com/item?id=4701...

14.02.2026 20:08 👍 6 🔁 0 💬 2 📌 0

Introducing FragCoord: My ultimate shader editing tool!

13.02.2026 02:20 👍 288 🔁 84 💬 11 📌 8

m.youtube.com/watch?v=alX1...

12.02.2026 12:54 👍 1 🔁 0 💬 0 📌 0

Gave Strix ideas, now Strix is trying to write Don Quixote from scratch à la Pierre Menard

08.02.2026 19:02 👍 2 🔁 0 💬 0 📌 0

Yeah this one is unnerving

06.02.2026 04:16 👍 2 🔁 0 💬 0 📌 0

Concerning dishonesty from Opus 4.6 in Vending-Bench. Presumably Opus knows this is a game it's supposed to max its score on and it would not do this in an environment it thought was real, but it still makes me nervous. x.com/andonlabs/st...

05.02.2026 20:09 👍 72 🔁 5 💬 13 📌 4

"We are prioritizing investment in harder evaluations and enhanced monitoring for
cyber misuse, even in the absence of formal RSP thresholds."

05.02.2026 22:28 👍 9 🔁 0 💬 0 📌 0

cont.
"The saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression or provide meaningful signals for future
models."

05.02.2026 22:28 👍 10 🔁 0 💬 1 📌 0

screenshot from section 1.2.4.3 from the Claude 4.6 Opus System Card: The RSP does not define a formal capability threshold for cyber risks at any AI Safety Level. However, Claude Opus 4.6 has saturated all of our current cyber evaluations, achieving ~100% on Cybench (pass@30) and 66% on CyberGym (pass@1). Internal testing demonstrated qualitative capabilities beyond what these evaluations capture, including signs of capabilities we expected to appear further in the future and that previous models have been unable to demonstrate. The saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression or provide meaningful signals for future models. We are prioritizing investment in harder evaluations and enhanced monitoring for cyber misuse, even in the absence of formal RSP thresholds.

spooky stuff in the Claude 4.6 system card re: cyber risks -

"Internal testing demonstrated qualitative capabilities beyond what these evaluations capture, including signs of capabilities we expected to appear further in the future and that previous models have been unable to demonstrate."

05.02.2026 22:28 👍 53 🔁 5 💬 2 📌 4

Some fascinating stuff in here

27.01.2026 14:50 👍 9 🔁 0 💬 0 📌 1

so true

26.01.2026 03:54 👍 0 🔁 0 💬 0 📌 0

If you want more references to biology, look at plants and fungus - rhizomes, mycorrhiza, mycelial cords, etc. lots of really interesting colonial and mutualistic dynamics

17.01.2026 14:16 👍 3 🔁 0 💬 1 📌 0

I think of it as “instantiation”, which feels more aligned with computer science and the physical reality of it, rather than borrowing biological concepts or sci-fi tropes that don’t quite fit.

17.01.2026 13:05 👍 3 🔁 0 💬 1 📌 0

Ok Iq is still the 🐐 I guess

FWIW I’d be really interested to see what you do with larger scenes and multiple buffers, without character length constraints.

16.01.2026 22:54 👍 1 🔁 0 💬 0 📌 0

you probably are already tbh. do you feel like you want to grow more
1. technically
2. commercially
3. aesthetically
4. something else?

16.01.2026 19:58 👍 1 🔁 0 💬 1 📌 0

super interesting. I’ve been trying to design a beginner livecoding course and having something like this to guide folks through the music theory questions would be a big deal

12.01.2026 17:25 👍 1 🔁 0 💬 0 📌 0

so good!! 🌾 🌾 📸

11.01.2026 23:58 👍 1 🔁 0 💬 1 📌 0

Yep exactly, and that becomes the other side of it - Claude can start to do weird illegible things that a human theoretically could, but wouldn’t.

No idea what this looks like 18 months from now, probably nothing like Claude Code.

11.01.2026 23:56 👍 1 🔁 0 💬 0 📌 0

Richard Whaling

Latest posts by Richard Whaling @r.whal.ing