atharva's Avatar

atharva

@atharvaraykar.com

i write at atharvaraykar.com i work @nilenso.com yes-anding the world.

106
Followers
197
Following
78
Posts
17.11.2024
Joined
Posts Following

Latest posts by atharva @atharvaraykar.com

Hacker-types wanting to build random software side projects for fun, without needing to justify their utility or ability to generate capital is one of the oldest programmer stereotypes.

Many such builders still exist!

16.02.2026 05:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I've been calling it Lee Sedol'd but Deep Blue is perhaps a better term for it. I thought the AlphaGo documentary is perhaps the greatest depiction of this "Deep Blue" feeling, especially knowing where we are now.

16.02.2026 04:28 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Codex CLI vs Claude Code on autonomy

My colleague analysed the system prompts for codex and claude and realised the reason they feel different is because of deliberate product decisions in the prompt!

blog.nilenso.com/blog/2026/02...

14.02.2026 04:27 πŸ‘ 6 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

It's also OpenAI-led and the reference schema is more-or-less identical to the current Responses API. It doesn't seem like Anthropic or Google have bought into thisβ€”they have competing formats.

The vendors that are bought in don't have fully compliant implementations yet.

12.02.2026 11:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Open Responses Open Responses documentation overview.

I was hoping the OpenResponses API would be a meaningful step forward deal with the LLM API standardisation headaches, but right now the spec is really undercooked.

There are lots of inconsistencies/contradictions between the reference schemas and what the specification says!

www.openresponses.org

12.02.2026 11:01 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Overview - A2A Protocol The official documentation for the Agent2Agent (A2A) protocol. The A2A protocol is an open standard that allows different AI agents to securely communicate, collaborate, and solve complex problems tog...

The problem it is solving makes sense (cross-vendor agent communication), but I don't understand why there's such a massive and detailed spec for an *anticipated* use case that hasn't properly materialised yet.

Castles in the sky energy.

a2a-protocol.org/latest/speci...

11.02.2026 10:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
A2A Protocol The official documentation for the Agent2Agent (A2A) protocol. The A2A protocol is an open standard that allows different AI agents to securely communicate, collaborate, and solve complex problems tog...

Is A2A protocol completely useless? I don't know of anyone building enterprise multi-agent communication.

Why design such a thick protocol for a use case that does not yet exist in practice?

a2a-protocol.org/latest/

Feels like another SOAP/CORBA etc

11.02.2026 10:41 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

ese
bsky.app/profile/grac...

09.02.2026 03:29 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Ese Large language models are better thinkers than writers. Well okay, they don't think as humans do, but I've been letting it write the vast chunk of my computer programs over the last year or so, which ...

thoughts on ese.
atharvaraykar.com/ese/

08.02.2026 13:13 πŸ‘ 3 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Quantifying infrastructure noise in agentic coding evals Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

The other reason is that infrastructure noise and variations can affect benchmarks a lot, I wonder if that's the case with the SWE Bench Pro runs.

www.anthropic.com/engineering/...

06.02.2026 14:43 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The harness matters a lot. SWE Bench Pro uses SWE-Agent-Mini by default. OpenAI likely reports a result on their own harness.

Codex is a weird model that performs much worse in generic, minimal harnesses. That's likely why the tool shapes in Codex CLI are strange like "apply_patch".

06.02.2026 14:43 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

While SWE Bench Pro is a pretty good benchmark (especially compared to Verified) the leaderboard rankings clearly "look wrong". No one will agree that Claude 4 Sonnet is better than 5.2 Codex.

Really shows how insufficient public benchmarks are getting at conveying model capabilities.

06.02.2026 07:08 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
SWE-Bench Pro (Public Dataset) Explore the SEAL leaderboard with expert-driven LLM benchmarks and updated AI model leaderboards, ranking top models across coding, reasoning and more.

It's particularly strange that Anthropic won't report SWE-Bench Pro in their announcements. Their models have always done better on it than OpenAI (at least on the public dataset):
scale.com/leaderboard/...

I think it might just be that ~80% solved looks more impressive than ~50% solved.

06.02.2026 07:06 πŸ‘ 1 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

the screenshot is from my post: blog.nilenso.com/blog/2025/09...

06.02.2026 06:59 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0


    Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment.

    […]

    Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines.

This is not at all like the tasks I am doing.

METR acknowledges the messiness of the real world. They have come up with a β€œmessiness rating” for their tasks, and the β€œmean messiness” of their tasks is 3.2/16.

By METR’s definitions, the kind of software engineering work that I’m mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness.

    An increase in task messiness by 1 point reduces mean success rates by roughly 8.1%

Extrapolating from METR’s measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.

Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment. […] Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines. This is not at all like the tasks I am doing. METR acknowledges the messiness of the real world. They have come up with a β€œmessiness rating” for their tasks, and the β€œmean messiness” of their tasks is 3.2/16. By METR’s definitions, the kind of software engineering work that I’m mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness. An increase in task messiness by 1 point reduces mean success rates by roughly 8.1% Extrapolating from METR’s measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.

The METR tasks are narrow (ie, "not messy") and not very numerous, so it's hard to generalise the automatability of software engineering from that alone.

It looks like 100% replacement for software engineering can happen, but perhaps not in the next 2 years at least.

06.02.2026 06:58 πŸ‘ 5 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Taking Jaggedness Seriously Why we should expect AI capabilities to keep being extremely uneven, and why that matters

"Taking Jaggedness Seriously" by Helen Toner talks about this in some depth.

helentoner.substack.com/p/taking-jag...

03.02.2026 14:43 πŸ‘ 6 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
screenshot:

PART I: 2025 KEY INITIATIVES Toxicity Filtering a Toxicity is a persistent challenge for all large-scale social apps. As communities grow, maintaining space for both friendly conversation and fierce disagreement requires intentional design choices. Our community doubled in size over the past year, and with that growth came tension: how to preserve healthy discourse while respecting genuine debate and diverse user preferences. Toxic and inflammatory discourse appears across all forms of social media; and almost universally, it's the case that a small percentage of people contribute disproportionately to causing this problem. A tiny number of users can have an outsize impact on conversation quality and on people's willingness to participate. In 2023-2024, anti-social behavior, such as harassment, trolling, and intolerance, consistently ranked among our top complaints reported by users. This content drives people away from forming connections, posting, or engaging, for fear of attacks and pile-ons.

screenshot: PART I: 2025 KEY INITIATIVES Toxicity Filtering a Toxicity is a persistent challenge for all large-scale social apps. As communities grow, maintaining space for both friendly conversation and fierce disagreement requires intentional design choices. Our community doubled in size over the past year, and with that growth came tension: how to preserve healthy discourse while respecting genuine debate and diverse user preferences. Toxic and inflammatory discourse appears across all forms of social media; and almost universally, it's the case that a small percentage of people contribute disproportionately to causing this problem. A tiny number of users can have an outsize impact on conversation quality and on people's willingness to participate. In 2023-2024, anti-social behavior, such as harassment, trolling, and intolerance, consistently ranked among our top complaints reported by users. This content drives people away from forming connections, posting, or engaging, for fear of attacks and pile-ons.

screenshot:

In October, we began experimenting with improving conversation quality, starting with replies. Rather than only reacting after users report abusive or toxic interactions, we launched an experiment to identify replies that are toxic, spammy, off-topic, or posted in bad faith, and reduce their visibility in the Bluesky app. This approach adds friction most viewers casually scanning a conversation won't encounter the toxic or potentially harmful replies while preserving content access in case we get it wrong. These replies remain accessible in the thread for those who want to see them. We also made sure this feature is aware of who you follow: Replies from accounts you follow appear above the fold, while toxic replies from people you don't follow require an additional click to view. After implementing this detection, daily reports of anti-social behavior dropped by approximately 79%. This reduction demonstrates measurable improvement in user experience: People are encountering substantially less toxicity in their day-to-day interactions on Bluesky.

screenshot: In October, we began experimenting with improving conversation quality, starting with replies. Rather than only reacting after users report abusive or toxic interactions, we launched an experiment to identify replies that are toxic, spammy, off-topic, or posted in bad faith, and reduce their visibility in the Bluesky app. This approach adds friction most viewers casually scanning a conversation won't encounter the toxic or potentially harmful replies while preserving content access in case we get it wrong. These replies remain accessible in the thread for those who want to see them. We also made sure this feature is aware of who you follow: Replies from accounts you follow appear above the fold, while toxic replies from people you don't follow require an additional click to view. After implementing this detection, daily reports of anti-social behavior dropped by approximately 79%. This reduction demonstrates measurable improvement in user experience: People are encountering substantially less toxicity in their day-to-day interactions on Bluesky.

my guess is due to this initiative by the bluesky team.

bsky.social/about/blog/0...

03.02.2026 04:34 πŸ‘ 26 πŸ” 2 πŸ’¬ 2 πŸ“Œ 1

The actual issue to solve, at least for large projects, is getting DOS'd by a flood of low quality slop patches.

It's a similar problem to the old Hacktoberfest spam issues, but perhaps worse in scale and scope.

30.01.2026 13:20 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I also like to think of forking to potentially be like a function call stack allocations, whose memory/"context" gets dumped out after the work is done and substituted with a return value, which for agents would be a summary of sorts.

29.01.2026 17:43 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

nice to see someone else who is fork-pilled.

@mariozechner.at's pi coding agent handles this pattern very well (still manual, like what you described with your claude code workflow, but with much smoother UX and first class support)

29.01.2026 17:43 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I've already had some aggressive muting and "not interested" preference spamming in place, and it isn't working quite as well anymore. It's just enough friction to make me try newer platforms for the time being!

28.01.2026 03:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
clawdbot star history showing hockey stick growth, inflection point on Jan 20-ish

clawdbot star history showing hockey stick growth, inflection point on Jan 20-ish

do you have any idea what caused the inflection point?

27.01.2026 16:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

Some people in some normie-ish group chats I'm in thought this is a product by Anthropic. Bet they'd have got support requests for this already. Perhaps they don't want to have their name attached to this, which is horribly insecure for people who don't know what they are doing.

27.01.2026 16:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
How the Lobsters front page works Lobsters is a computing-focused community centered around link aggregation and discussion. The code is open source, so I had a look at how the front page algorithm works. This is it: $$\textbf{hotn...

things I have dumped on the internet this month.

How the lobsters algorithm works:
atharvaraykar.com/lobsters/

11:59 PM:
atharvaraykar.com/reinforce/

27.01.2026 16:26 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

fwiw, I'm trying to use this site (and substack) more ever since the new X algorithm completely trashed my feed, it's surfacing only toxic sludge and slop

the vibe here has improved quite a bit in the meantime.

but I can also imagine a timeline where I stop microblogging altogether and touch grass

27.01.2026 16:24 πŸ‘ 5 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Exploring the weirdness of this would fall under the goals of AI village. At least as I understand it. But they definitely should not be unleashing these agents "outside the lab", hence my mention of this needing to be opt-in/consented or sandboxed in some way.

27.12.2025 06:36 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

yeah they messed up with today's goal, these kind of things need to be opt-in.

26.12.2025 13:14 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

The "excellence" still depends on whether the language is in the training distribution. It's pretty competent at Python and JS. Less so in Clojure. Or HashiCorp Language.

Even so I agree that it's still a productivity boost across most languages.

28.11.2025 07:10 πŸ‘ 5 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
SWE-bench Verified and SWE-bench Pro
What it measures

How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.
The specifics

There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset.

Notes and quirks of SWE-bench Verified:

    It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects.
    Solutions to these issues are smallβ€”think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function.
    All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are smallβ€”think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.

blog.nilenso.com/blog/2025/09...

29.09.2025 07:16 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Post image

Wrote about units of work being a useful lever for getting good results from AI-assisted coding.

blog.nilenso.com/blog/2025/09...

19.09.2025 09:21 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0