nilenso's Avatar

nilenso

@nilenso.com

Employee-owned programmer cooperative in Bangalore.

881
Followers
101
Following
2
Posts
18.11.2024
Joined
Posts Following

Latest posts by nilenso @nilenso.com

SWE-bench Verified and SWE-bench Pro
What it measures

How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.
The specifics

There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset.

Notes and quirks of SWE-bench Verified:

    It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects.
    Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function.
    All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.

blog.nilenso.com/blog/2025/09...

29.09.2025 07:16 👍 1 🔁 1 💬 0 📌 0
Post image

Wrote about units of work being a useful lever for getting good results from AI-assisted coding.

blog.nilenso.com/blog/2025/09...

19.09.2025 09:21 👍 1 🔁 1 💬 0 📌 0
My Quarterly System Health Check-in

It is essential to periodically take a few steps back from the day to day and reflect on where we are against our strategic goals. If you’re an engineering leader, a head of engineering, a director, or a VP, you likely have a recurring meeting to this effect.

In this post, I propose a structure for this operational exercise (complementing a business review) that lasts 2-4 hours, every month or quarter. I see quality as solving for the Pareto front with the tangible dimensions of reliability, performance, cost, delivery and security, and the more intangible dimensions of simplicity and social structures. For each dimension, go through the list of questions below and try to answer them together.

My Quarterly System Health Check-in It is essential to periodically take a few steps back from the day to day and reflect on where we are against our strategic goals. If you’re an engineering leader, a head of engineering, a director, or a VP, you likely have a recurring meeting to this effect. In this post, I propose a structure for this operational exercise (complementing a business review) that lasts 2-4 hours, every month or quarter. I see quality as solving for the Pareto front with the tangible dimensions of reliability, performance, cost, delivery and security, and the more intangible dimensions of simplicity and social structures. For each dimension, go through the list of questions below and try to answer them together.

I've been poking Srihari, our most experienced engineer @nilenso.com to share his hard-earned knowledge for the benefit of others.

Even if you're not an engineering leader like me, this checklist gives a lot of insight into what makes a great engineering org.

blog.nilenso.com/blog/2025/09...

12.09.2025 06:32 👍 2 🔁 2 💬 0 📌 0
Image description
While Kuhn doesn’t go into it, the technological diffusion, and the economic impact of scientific revolutions are better studied through the GPT (general purpose technology) paper from Bresnahan & Trajtenberg in 1995.

> “General Purpose Technologies (GPTs) are technologies that can affect an entire economy (usually at a national or global level). They have the potential for pervasive use in a wide range of sectors and, as they improve, they contribute to overall productivity growth.”

And Calvino et al in June 2025, finds that AI meets the key criteria of a General Purpose Technology. It’s pervasive, rapidly improving, enables new products, services and research methodologies, and enhances other sectors’ R&D and productivity.

Image description While Kuhn doesn’t go into it, the technological diffusion, and the economic impact of scientific revolutions are better studied through the GPT (general purpose technology) paper from Bresnahan & Trajtenberg in 1995. > “General Purpose Technologies (GPTs) are technologies that can affect an entire economy (usually at a national or global level). They have the potential for pervasive use in a wide range of sectors and, as they improve, they contribute to overall productivity growth.” And Calvino et al in June 2025, finds that AI meets the key criteria of a General Purpose Technology. It’s pervasive, rapidly improving, enables new products, services and research methodologies, and enhances other sectors’ R&D and productivity.

Why Does AI Feel So Different?

An enjoyable read from my colleague, Srihari.

We've been talking about why this disruption feels different from other recent technological disruptions and he captured a lot of that really well in this post.

Link: blog.nilenso.com/blog/2025/08...

14.08.2025 07:15 👍 0 🔁 1 💬 1 📌 0
Table of Contents

    General guidelines
    Starting Points
    Official announcements, blogs and papers from those building AI
    High signal people to follow
    News and Media
    Esoterica
    Do I chug water from a firehose?

Table of Contents General guidelines Starting Points Official announcements, blogs and papers from those building AI High signal people to follow News and Media Esoterica Do I chug water from a firehose?

Several people ask me about how I'm keeping up with all the AI things and finding signal in this noisy landscape. I wrote a guide explaining this.

blog.nilenso.com/blog/2025/06...

04.07.2025 09:00 👍 4 🔁 1 💬 0 📌 0
AI-assisted coding for teams that can't get away with vibes - nilenso blog ...

AI-assisted coding for teams that can't get away with vibes

blog.nilenso.com/blog/2025/05...

10.06.2025 22:12 👍 112 🔁 18 💬 5 📌 5
Video thumbnail

i quickly concocted a writer's block unblocker
(with @tldraw.com computer)

it takes an oblique strategy (from the brian eno et al card deck) and uses it to provide unhinged critique of the essay you are working on to help you break out of a rut

link to program: computer.tldraw.com/t/4KoB33nFEr...

29.12.2024 17:23 👍 9 🔁 2 💬 1 📌 0
Acknowledge that all metrics are approximations, and that they will fall short (as numbers always do), when they try to represent reality. Also ask yourself if a more precise measurement will really make a significant difference to outcomes. If not, move on. You can always revisit the metrics as your product and consumers evolve, but frequent churn with definitions, and parsing increasing amounts of data in ever more complex ways will invariably cost you more than it’s worth.

Acknowledge that all metrics are approximations, and that they will fall short (as numbers always do), when they try to represent reality. Also ask yourself if a more precise measurement will really make a significant difference to outcomes. If not, move on. You can always revisit the metrics as your product and consumers evolve, but frequent churn with definitions, and parsing increasing amounts of data in ever more complex ways will invariably cost you more than it’s worth.

Wise product managers know that "good enough data" is better than "perfect data".

Focus on purpose, not perfection, writes Deepa on the nilenso blog.

blog.nilenso.com/blog/2024/12...

27.12.2024 08:56 👍 4 🔁 0 💬 0 📌 0
Huh? A software cooperative? - nilenso blog Steven Deobald ...

🦋

Huh? A software cooperative?

blog.nilenso.com/blog/2014/11...

18.11.2024 09:57 👍 3 🔁 0 💬 0 📌 0