More of my thoughts on this topic here: [https://beyondruntime.substack.com/p/spec-driven-development-the-rebranded](beyondruntime.substack.com/p/spec-drive...)
More of my thoughts on this topic here: [https://beyondruntime.substack.com/p/spec-driven-development-the-rebranded](beyondruntime.substack.com/p/spec-drive...)
If your team isnβt discussing Spec-Driven Development or how your design decisions are documented, versioned, and shared, youβre undermining your AI tooling strategy.
AI agents (especially when running in parallel) donβt operate well on vague intent. They need precise, well-reasoned specifications to make consistent decisions.
Most engineers tune out when they hear βsystem design.β
It feels like overhead.
Until AI forces you to care about it.
If your data is scattered across tools, aggressively sampled, and missing payloads β¦ AI can't magically correlate it for you.
You're automating a broken workflow.
Full article: [https://beyondruntime.substack.com/p/a-major-incident-will-be-traced-back](beyondruntime.substack.com/p/a-major-in...)
Why is this worrisome? Because the latest State of Code Developer Survey from Sonar reports that AI-generated or significantly AI-assisted will jump to 65% by 2027.
This effectively means that very soon (if not already) a major incident will be traced back to an AI coding tool.
So the net effect of AI-assisted development is that weβve offloaded the part developers are generally comfortable with (writing code), and left them with the part thatβs harder (system design, reviews, debugging, etc.), but without the context built naturally by doing the writing themselves.
Reading and understanding someone elseβs code is significantly harder than writing code yourself. AI-generated code is, in every meaningful sense, someone elseβs code.
Adding AI to legacy observability practices won't make debugging faster.
It'll just amplify the problem.
The talk covers the modern telemetry data problem, why most MCP implementations inherit broken observability practices, and the path to self-healing systems that can actually act on the right data.
Full agenda: leaddev.com/leaddev-lond...
Built an MCP server recently? Did developers actually use it, or is it collecting dust? π΅
I'll be speaking at @leaddev.com (June 1-2) about why "connecting AI to everything" doesn't work and what actually does when building tools that move from assistants to autonomous agents.
See you there? π¬π§
πΒ IMO, the missing piece of the puzzle is having runtime visibility into your system, auto-correlated across the stack.
βΆ Reading unfamiliar code is exhausting. Now imagine that code is coming from an LLM that writes faster than you can think and doesn't take lunch breaks β
The AI productivity paradox: code is written faster than ever, but humans canβt keep up with manually reviewing it (or debugging it)!
You need the right runtime context to not fly blind.
More about this here
beyondruntime.substack.com/p/the-ai-gua...
You can't reproduce non-deterministic behavior. You need the actual context from when it happened: the prompts, the reasoning, the state of the system, what external services it returned.
The guardrails problem is both about safety AND observability.
When an LLM writes code for you, the output is deterministic: you can read it, test it, fix it (even if it's harder than writing it yourself).
But when an AI agent makes decisions in prod (e.g. which API to call, how to respond to a user, etc.) traditional debugging breaks down.
Debugging AI-generated code is annoying (and time consuming).
Debugging AI-as-a-runtime-component is a completely different problem.
Most teams lose hundreds of engineering hours per month to correlation tax. Time that could be spent shipping features instead of hunting for information.
Read the full breakdown in my latest post: dzone.com/articles/aut...
(1) Track your last 5 incidents. How much time did you spend finding the data vs writing the fix?
(2) Which types of bugs required the most time to hunt for data?
(3) Is your observability stack solving this or just shifting it around?
So engineers spend hours playing detective: copying request IDs between tools, matching timestamps, manually piecing together what happened.
Here's how to determine whether your team has a correlation problem: π
Teams that fix the data correlation problem ship faster and debug smarter. But what exactly is it?
Itβs when debugging (understanding what went wrong) takes longer than fixing because the information is in a bunch of different places: Sentry, Stripe, LogRocket and several APM tools.
Traditional software:Β `Input A β Output B`Β (always, reliably, debuggably)
AI-generated software:Β `Input A β Output B... or B', or B'', or sometimes C`
Thatβs why π debugging AI-generated code is tricky.
Five screenshots. Three 'can you also send...' messages. Twenty minutes later, you finally have enough context to start debugging.
Have you ever had (or witnessed) a similar conversation in Slack? π
Agreed. AI excels at pattern recognition (code review, static analysis) and iterative debugging when it has the *right context*.
That's the key: what you feed them. AI debugging works when you give it complete context: user session replays, full traces, request/response data, runtime behavior.
AI debugging is great... until you hit the context bottleneck. In particular, you need runtime data: what the user did, what the backend processed, what came back, etc.
Without that, you're indeed getting vibes, not diagnosis or solutions.
The five blind spots:
(1) Runtime visibility
(2) Hallucinations: Compiles β works correctly
(3) Narrow debugging context
(4) Performance: No awareness of memory, concurrency, scaling bottlenecks
(5) Architecture: Missing organizational context, budget constraints, Conway's Law
AI tools are exceptional at the 99.2%. They struggle with the 0.8% that actually matters.
I just published a breakdown of where AI systematically fails π
Marc Donner: βOf all my programming bugs, 80% are syntax errors. Of the remaining 20%, 80% are trivial logical errors. Of the remaining 4%, 80% are pointer errors. And the remaining 0.8% are hard.β
This sentiment is surprising relevant for AI tools.