Hacker-types wanting to build random software side projects for fun, without needing to justify their utility or ability to generate capital is one of the oldest programmer stereotypes.
Many such builders still exist!
Hacker-types wanting to build random software side projects for fun, without needing to justify their utility or ability to generate capital is one of the oldest programmer stereotypes.
Many such builders still exist!
I've been calling it Lee Sedol'd but Deep Blue is perhaps a better term for it. I thought the AlphaGo documentary is perhaps the greatest depiction of this "Deep Blue" feeling, especially knowing where we are now.
My colleague analysed the system prompts for codex and claude and realised the reason they feel different is because of deliberate product decisions in the prompt!
blog.nilenso.com/blog/2026/02...
It's also OpenAI-led and the reference schema is more-or-less identical to the current Responses API. It doesn't seem like Anthropic or Google have bought into thisβthey have competing formats.
The vendors that are bought in don't have fully compliant implementations yet.
I was hoping the OpenResponses API would be a meaningful step forward deal with the LLM API standardisation headaches, but right now the spec is really undercooked.
There are lots of inconsistencies/contradictions between the reference schemas and what the specification says!
www.openresponses.org
The problem it is solving makes sense (cross-vendor agent communication), but I don't understand why there's such a massive and detailed spec for an *anticipated* use case that hasn't properly materialised yet.
Castles in the sky energy.
a2a-protocol.org/latest/speci...
Is A2A protocol completely useless? I don't know of anyone building enterprise multi-agent communication.
Why design such a thick protocol for a use case that does not yet exist in practice?
a2a-protocol.org/latest/
Feels like another SOAP/CORBA etc
ese
bsky.app/profile/grac...
The other reason is that infrastructure noise and variations can affect benchmarks a lot, I wonder if that's the case with the SWE Bench Pro runs.
www.anthropic.com/engineering/...
The harness matters a lot. SWE Bench Pro uses SWE-Agent-Mini by default. OpenAI likely reports a result on their own harness.
Codex is a weird model that performs much worse in generic, minimal harnesses. That's likely why the tool shapes in Codex CLI are strange like "apply_patch".
While SWE Bench Pro is a pretty good benchmark (especially compared to Verified) the leaderboard rankings clearly "look wrong". No one will agree that Claude 4 Sonnet is better than 5.2 Codex.
Really shows how insufficient public benchmarks are getting at conveying model capabilities.
It's particularly strange that Anthropic won't report SWE-Bench Pro in their announcements. Their models have always done better on it than OpenAI (at least on the public dataset):
scale.com/leaderboard/...
I think it might just be that ~80% solved looks more impressive than ~50% solved.
the screenshot is from my post: blog.nilenso.com/blog/2025/09...
Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment. [β¦] Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines. This is not at all like the tasks I am doing. METR acknowledges the messiness of the real world. They have come up with a βmessiness ratingβ for their tasks, and the βmean messinessβ of their tasks is 3.2/16. By METRβs definitions, the kind of software engineering work that Iβm mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness. An increase in task messiness by 1 point reduces mean success rates by roughly 8.1% Extrapolating from METRβs measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.
The METR tasks are narrow (ie, "not messy") and not very numerous, so it's hard to generalise the automatability of software engineering from that alone.
It looks like 100% replacement for software engineering can happen, but perhaps not in the next 2 years at least.
"Taking Jaggedness Seriously" by Helen Toner talks about this in some depth.
helentoner.substack.com/p/taking-jag...
screenshot: PART I: 2025 KEY INITIATIVES Toxicity Filtering a Toxicity is a persistent challenge for all large-scale social apps. As communities grow, maintaining space for both friendly conversation and fierce disagreement requires intentional design choices. Our community doubled in size over the past year, and with that growth came tension: how to preserve healthy discourse while respecting genuine debate and diverse user preferences. Toxic and inflammatory discourse appears across all forms of social media; and almost universally, it's the case that a small percentage of people contribute disproportionately to causing this problem. A tiny number of users can have an outsize impact on conversation quality and on people's willingness to participate. In 2023-2024, anti-social behavior, such as harassment, trolling, and intolerance, consistently ranked among our top complaints reported by users. This content drives people away from forming connections, posting, or engaging, for fear of attacks and pile-ons.
screenshot: In October, we began experimenting with improving conversation quality, starting with replies. Rather than only reacting after users report abusive or toxic interactions, we launched an experiment to identify replies that are toxic, spammy, off-topic, or posted in bad faith, and reduce their visibility in the Bluesky app. This approach adds friction most viewers casually scanning a conversation won't encounter the toxic or potentially harmful replies while preserving content access in case we get it wrong. These replies remain accessible in the thread for those who want to see them. We also made sure this feature is aware of who you follow: Replies from accounts you follow appear above the fold, while toxic replies from people you don't follow require an additional click to view. After implementing this detection, daily reports of anti-social behavior dropped by approximately 79%. This reduction demonstrates measurable improvement in user experience: People are encountering substantially less toxicity in their day-to-day interactions on Bluesky.
my guess is due to this initiative by the bluesky team.
bsky.social/about/blog/0...
The actual issue to solve, at least for large projects, is getting DOS'd by a flood of low quality slop patches.
It's a similar problem to the old Hacktoberfest spam issues, but perhaps worse in scale and scope.
I also like to think of forking to potentially be like a function call stack allocations, whose memory/"context" gets dumped out after the work is done and substituted with a return value, which for agents would be a summary of sorts.
nice to see someone else who is fork-pilled.
@mariozechner.at's pi coding agent handles this pattern very well (still manual, like what you described with your claude code workflow, but with much smoother UX and first class support)
I've already had some aggressive muting and "not interested" preference spamming in place, and it isn't working quite as well anymore. It's just enough friction to make me try newer platforms for the time being!
clawdbot star history showing hockey stick growth, inflection point on Jan 20-ish
do you have any idea what caused the inflection point?
Some people in some normie-ish group chats I'm in thought this is a product by Anthropic. Bet they'd have got support requests for this already. Perhaps they don't want to have their name attached to this, which is horribly insecure for people who don't know what they are doing.
things I have dumped on the internet this month.
How the lobsters algorithm works:
atharvaraykar.com/lobsters/
11:59 PM:
atharvaraykar.com/reinforce/
fwiw, I'm trying to use this site (and substack) more ever since the new X algorithm completely trashed my feed, it's surfacing only toxic sludge and slop
the vibe here has improved quite a bit in the meantime.
but I can also imagine a timeline where I stop microblogging altogether and touch grass
Exploring the weirdness of this would fall under the goals of AI village. At least as I understand it. But they definitely should not be unleashing these agents "outside the lab", hence my mention of this needing to be opt-in/consented or sandboxed in some way.
yeah they messed up with today's goal, these kind of things need to be opt-in.
The "excellence" still depends on whether the language is in the training distribution. It's pretty competent at Python and JS. Less so in Clojure. Or HashiCorp Language.
Even so I agree that it's still a productivity boost across most languages.
SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are smallβthink surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus itβs hard to tell how much of the improvements are due to memorisation.
I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.
blog.nilenso.com/blog/2025/09...
Wrote about units of work being a useful lever for getting good results from AI-assisted coding.
blog.nilenso.com/blog/2025/09...