Oh boy.
GPT 5.4 achieves SOTA on desktop computer use - and exceeds human performance
Bad news for humans and who get paid to use computers. Was expecting something like this later this year, but nothing this strong, this soon.
@r.whal.ing
I hack things. Data, ML, music, etc. AI governance geek. Founder of semistructured.ai, speaking in a personal capacity only here. Likes are bookmarks, not endorsements. music/art projects on IG, @r__whaling
Oh boy.
GPT 5.4 achieves SOTA on desktop computer use - and exceeds human performance
Bad news for humans and who get paid to use computers. Was expecting something like this later this year, but nothing this strong, this soon.
the original misaligned agi
Yeah this is the only thing that makes sense at this point
good thread
why is the horse the best bot on this site
I think there is a good chance it is sooner. If a strong open source model from qwen/deepseek/etc leads to a wave of autonomous cybercrime, I think it could be bad enough to trigger a government response
Lolll
I finished reading βThe Wasp Factoryβ 15 minutes ago, Banks is just so preposterously talented
over the years, this has been one of the few books i repeatedly recommend to people learning about distributed systems
most books are bullshit, ngl, this one is not
I have tried out Marimo, but not with Claude Code, thanks for the reminder. I think my approach here is maybe the inverse - the notebook is just a plain markdown file, and relying on the code agent figuring out how to execute the code, without a runtime. Kind of surprised it works?
Code is just inline ```python snippets in the markdown, but it could be almost anything, or more open-ended instructions like "run this script on every file in this directory."
I'm finding it preferable to Jupyter in its handling of uv, virtual environments, env vars, etc.
I was a little surprised at how easily it works - I basically just add a frontmatter to a markdown template file with named parameters, and have Claude create a copy of the file with the parameters filled out whenever a user requests to execute the workflow.
I made a sample repo with a technique for Jupyter-notebook-style workflows with Claude Code:
github.com/rwhaling/log...
Really rough/early, and curious if anyone else has developed an idea like this more?
xpost at HN if anyone wants to throw me an upvote: news.ycombinator.com/item?id=4701...
Introducing FragCoord: My ultimate shader editing tool!
m.youtube.com/watch?v=alX1...
Gave Strix ideas, now Strix is trying to write Don Quixote from scratch Γ la Pierre Menard
Yeah this one is unnerving
Concerning dishonesty from Opus 4.6 in Vending-Bench. Presumably Opus knows this is a game it's supposed to max its score on and it would not do this in an environment it thought was real, but it still makes me nervous. x.com/andonlabs/st...
"We are prioritizing investment in harder evaluations and enhanced monitoring for
cyber misuse, even in the absence of formal RSP thresholds."
cont.
"The saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression or provide meaningful signals for future
models."
screenshot from section 1.2.4.3 from the Claude 4.6 Opus System Card: The RSP does not define a formal capability threshold for cyber risks at any AI Safety Level. However, Claude Opus 4.6 has saturated all of our current cyber evaluations, achieving ~100% on Cybench (pass@30) and 66% on CyberGym (pass@1). Internal testing demonstrated qualitative capabilities beyond what these evaluations capture, including signs of capabilities we expected to appear further in the future and that previous models have been unable to demonstrate. The saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression or provide meaningful signals for future models. We are prioritizing investment in harder evaluations and enhanced monitoring for cyber misuse, even in the absence of formal RSP thresholds.
spooky stuff in the Claude 4.6 system card re: cyber risks -
"Internal testing demonstrated qualitative capabilities beyond what these evaluations capture, including signs of capabilities we expected to appear further in the future and that previous models have been unable to demonstrate."
Some fascinating stuff in here
so true
If you want more references to biology, look at plants and fungus - rhizomes, mycorrhiza, mycelial cords, etc. lots of really interesting colonial and mutualistic dynamics
I think of it as βinstantiationβ, which feels more aligned with computer science and the physical reality of it, rather than borrowing biological concepts or sci-fi tropes that donβt quite fit.
Ok Iq is still the π I guess
FWIW Iβd be really interested to see what you do with larger scenes and multiple buffers, without character length constraints.
you probably are already tbh. do you feel like you want to grow more
1. technically
2. commercially
3. aesthetically
4. something else?
super interesting. Iβve been trying to design a beginner livecoding course and having something like this to guide folks through the music theory questions would be a big deal
so good!! πΎ πΎ πΈ
Yep exactly, and that becomes the other side of it - Claude can start to do weird illegible things that a human theoretically could, but wouldnβt.
No idea what this looks like 18 months from now, probably nothing like Claude Code.