I am working on a food tracker for my very specific needs, and Claude just turned into every developer I ever worked with.
I am working on a food tracker for my very specific needs, and Claude just turned into every developer I ever worked with.
"The hard part is scalability, not automation." That line from session one of "AI evals and analytics" confused me. Session two explained it.Β
Full write-up in my blog: https://kato-coaching.com/the-ai-evals-field-chose-a-flawed-tool-and-stuck-with-it/
#AIEvals #SoftwareTesting #QA
Updated my website this week β it should finally be clear what I do and how to work with me. Courses, workshops, and 1:1 coaching for QA professionals.Β
kato-coaching.com
If the output is slop regardless of how you phrase it, the problem isn't the prompt. It's the use case.Β
Free scorecard:
Before I wrote today's post, I defined what good looked like: gives value, doesn't rely on outrage, sounds like me. Writing to a clear brief changes the experience. So does diagnosing a draft. Testers do this before they run anything.
The AI correction loop usually starts before the tool is opened, at the moment someone chose the wrong use case for it. Testers already know how to ask whether a tool suits a problem. That skill just hasn't been applied here yet. More soon.
The skills QA professionals already have (defining success criteria, testing behaviour, not trusting metrics at face value) are exactly what's missing from most AI integrations. I'm learning AI evals to understand why.Β
Session one: kato-coaching.com/what-i-dont-understand-about-ai-evals-yet/
Ran my workshop "Deciding Fast" on Tuesday with a software team in Sweden. Everyone in one room sharing computers, no breakouts. It's built for remote. I adapted. 6/8 Good or Excellent. Best response to "what will you do differently?": "Set clearer success condition."
#SoftwareTesting #AITesting
"Tell me what you're uncertain about." "Push back if my assumptions are wrong." Only 30% of people give instructions like these. The model defaults to confident and agreeable.Β
https://www.anthropic.com/research/AI-fluency-index
#SoftwareTesting #AITesting #AILiteracy
The strongest predictor of AI fluency, per Anthropic's research: iteration. Treating the first response as a draft, not an answer. 5.6x more likely to question reasoning. Familiar territory if you work in testing.Β
https://www.anthropic.com/research/AI-fluency-index
#SoftwareTesting #AITesting
Sounds really interesting! I hope you can share outside of that conference presentation, Iβd love to hear more when you have it.
Thatβs a great approach. I presume you donβt want to spoil your punchline by telling us how itβs going?
Six months in and nobody can say whether the AI is actually working. Not anecdotally, but with evidence. That gap is the most common thing I see in QA teams right now.Β
How do you measure if the new licence is worth the money?
#softwaretesting #QA #AItools
18% of testers I surveyed said their top AI frustration is not bad output. It is that the tools have no sense of test strategy.
The AI is not wrong. It is indiscriminate.
#AITesting #SoftwareTesting #TestStrategy
I timed myself running tests manually, then gave the same task to an AI tool and timed that too.
The result was not what I expected.
Full breakdown with video later this week.
You ask the AI to generate a few test cases for a feature you know inside out. Save yourself twenty minutes.
What comes back looks reasonable. Then you start reading.
Read the full story in my blog:
Why do so many testers call AI "exhausting"?
It is not the learning curve. It is the correction loop. You spend 45 minutes fixing output that was supposed to save you an hour.
Five experiments to figure out which tasks AI actually improves:
I asked testers what worried them most about AI. 24% said management expectations, not the tools.
"Pressure to deliver has increased dramatically because management thinks we must be twice as productive now."
Argue with evidence:
I surveyed 17 testers about their biggest AI frustrations. 65% said the same thing: the output is unreliable.
One called it "slop." Another described the correction loop as exhausting.
I wrote a free guide: 5 small experiments
I have blogged about why I think the rise of AI tooling is not the end of the world, as many in the testing community see it.Β
AI can generate 500 test cases in an hour.
But if you don't know what decision you're supporting, you're just generating noise faster.
I'm updating the QED course on this. Quick survey if you've worked with AI testing tools:
AI tools won't fix your testing problems if you're solving the wrong problems.
The QED framework: start with what's breaking, who feels it, and what it costs. Then run a two-week experiment.
Works whether you're writing tests by hand or using AI.
AI generates test cases fast. Most of them are useless.Β
Here's how to filter before you generate:Β
The AI review debate feels similar to early reactions to spellcheck.
Tools that scan for specific issues at scale can raise the baseline, as long as the rules are clear and judgement stays with the human.
As a German living abroad, with many friend all over the world and specifically in the U.S., this summarises well how I feel about the current situation.
AI did not break testing. We were already stuck.
Watching QA argue about titles while AI becomes the new panic topic feels familiar.Β
We already know how to deal with uncertainty and risk. We just keep choosing not to apply it.
The βAI reviewing AI is marking its own homeworkβ argument is based on an incorrect analogy.
Most poor AI output starts with vague prompts. Reviewing with clear, narrow criteria can be useful, because AI is good at scanning volume when the rules are explicit.
Do you let AI review AI?
The problem with many AI written tests is not that they are βwrongβ.
Itβs that they are not all that useful.
When teams are told to βjust use AIβ, they often skip the thinking step that normally shapes good test design.Β
I rage quit a project after an AI βsimple UI reskinβ rewrote half my system.
The issue wasnβt the model. It was vibe coding without constraints. Fast output, eroded structure, invisible scope creep.
But I didn't ditch the AI.Β
AI can generate test cases fast, which is why judgement matters more, not less.Β
If you canβt name the decision, the uncertainty, or the constraints, AI will give you volume instead of confidence.Β
How do you tell activity from evidence in your team?