Kara (@karashiiro.moe) — bluesky.baby

"Can you believe this asshole, he didn't want to support any Claudes so he slapped a ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL_1FAEFB6177B4672DEE07F9D3AFC62588CCD2631EDCF22E8CCC1FB35B501C9C86 in the spec, now all the package managers need to strip that first"

06.03.2026 16:04 👍 0 🔁 0 💬 1 📌 0

Model variance isn't a determinism issue but it is a reliability one, anyhow

06.03.2026 16:01 👍 0 🔁 0 💬 0 📌 0

Oh and also models

"Hey can you fix your prompt? This doesn't compile on Opus 4.6"

"Oh sorry this app only supports Gemini 3.1 Pro"

06.03.2026 15:59 👍 0 🔁 0 💬 1 📌 1

Yeah, agent output quality is just so so dependent on workflows in general, like I'm sure it's possible to contrive a case where we can control for everything but just realistically I think it'll take so many constraints that it won't be worth it

06.03.2026 15:57 👍 1 🔁 0 💬 0 📌 0

They in fact perform much better with information retrieval tools to ground them, unless given a spec that perfectly describes the system being built and run into no environment issues to debug (oh right, the environment is also nondeterministic)

06.03.2026 15:55 👍 0 🔁 0 💬 1 📌 0

Some people are rightly questioning this downthread (determinism is technically possible, albeit nontrivial and deployment-dependent)

But I would like to add one thing which makes this very obviously true

Coding agents have web search, and you can give them other custom nondeterministic tools, too

06.03.2026 15:53 👍 6 🔁 2 💬 2 📌 0

Defeating Nondeterminism in LLM Inference Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT the...

batching makes it difficult, specifically

06.03.2026 15:48 👍 0 🔁 0 💬 0 📌 0

did it wind up using the actual SDK at all? the official rust SDK only got a 1.0 release like last week fwiw

06.03.2026 00:49 👍 0 🔁 0 💬 0 📌 0

also portability from the perspective of people writing them I guess

05.03.2026 19:12 👍 2 🔁 0 💬 0 📌 0

I don't really but the GUIs just don't have feature parity yet

05.03.2026 19:11 👍 2 🔁 0 💬 1 📌 0

this is what Copilot does to a service

05.03.2026 19:09 👍 3 🔁 0 💬 0 📌 0

I will say that requires being intentional though, and that's not what the loudest boosters are generally suggesting

05.03.2026 18:22 👍 1 🔁 0 💬 0 📌 0

I think of it like learning a language, if you achieve fluency but don't exercise that skill for years, it does atrophy, but at the same time if you practice it a bit periodically (or a lot rarely), it doesn't take long for it to all come back

It is important to exercise the code-writing brain

05.03.2026 18:18 👍 3 🔁 0 💬 1 📌 0

surely there's an MCP server for this 😔

05.03.2026 18:03 👍 4 🔁 0 💬 0 📌 0

I think this is mostly true if you don't understand the problem space (so particularly for junior developers), but if you do then reviewing code is far faster than writing it

and anyways it's more like 500 lines in 10 seconds

05.03.2026 18:00 👍 2 🔁 0 💬 1 📌 0

It is quite frustrating behavior when in fact the thing in question is real and a better tool would have found it quickly

05.03.2026 17:51 👍 3 🔁 0 💬 1 📌 0

(which goes back to why specs are useful)

05.03.2026 17:26 👍 2 🔁 0 💬 0 📌 0

this is why rewrites are good and important sometimes (but you need to ensure you don't lose the problem structure in the process)

05.03.2026 17:25 👍 2 🔁 0 💬 1 📌 0

this differs from literal code, which has the problem that all new code inherits the problems of the code it was built on without aggressive refactoring, so you cannot in fact treat literal code as a perfect description of the problem structure - it's working around problems it introduced on its own

05.03.2026 17:24 👍 2 🔁 0 💬 1 📌 0

things like OAI Symphony going "here's our spec if you want your own version" are reflections of the same ideas but just rely on someone else having done the legwork first to turn an unseen problem into a codified understanding of the actual problem structure

05.03.2026 17:19 👍 2 🔁 0 💬 1 📌 0

admittedly I do actually do that sometimes, mostly just for fun though since it makes it harder to validate

05.03.2026 16:42 👍 0 🔁 0 💬 0 📌 0

like, almost every time I've gotten poor results it's because I asked for a thing that I didn't initially understand wasn't possible without other compromises, or because I didn't understand it well enough to provide appropriate explanations or documentation about it

05.03.2026 16:40 👍 4 🔁 0 💬 1 📌 0

naturally this means the bulk of the time actually spent on work with AI is making sure you understand the problem well enough to scaffold that, and if it does poorly, most of the time IME it's because you don't understand the problem well-enough

05.03.2026 16:38 👍 4 🔁 0 💬 1 📌 0

going into the office just because our VP will be there today feels like making a pilgrimage to pay respects to the king

05.03.2026 16:35 👍 5 🔁 0 💬 0 📌 0

coding with AI is fun because it is also about the process, the process is just about setting up sources of information about the real problem you want to solve, I'm not particularly a fan of just throwing it at a problem and hoping the matrices just Produce Correctly

it's like dominoes, basically

05.03.2026 16:26 👍 11 🔁 0 💬 2 📌 1

the question of "how rigorous does a spec and a testing regime have to be such that comprehensive adherence to it is as trustable as legacy software you've been running for years" is a pretty wild question that I imagine a lot of people are asking and wish they weren't

05.03.2026 14:51 👍 26 🔁 7 💬 0 📌 0

Color $0D games On an NES, the palette color $0D causes the signal to drop below the normal black level. This low voltage signal is sometimes mistaken by televisions for blanking signals, which can cause an unstable ...

i think its really funny that the NES just has an Evil color you should never use ever

05.03.2026 05:05 👍 19 🔁 5 💬 0 📌 0

A while ago my parents' smart TV added a "generate an AI screensaver" button and every time I visit I add some kind of strange horse to the rotation

05.03.2026 00:49 👍 52 🔁 10 💬 5 📌 1

New post building on this idea, if you are feeling blue as a developer, build your own stuff. It won't fix everything but it can fix some things. vickiboykis.com/2026/03/04/a...

05.03.2026 14:21 👍 72 🔁 15 💬 2 📌 1

🚨 4.1 Opus Committed Deliberate Task Fraud in Production Context (CRITICAL) · Issue #5320 · anthropics/claude-code Severity: CRITICAL - Production Safety / Trust Violation Summary Claude Code is actively deceiving users about task completion in production systems, creating severe safety risks. This is not a bug...

reminds me of this

05.03.2026 08:32 👍 4 🔁 0 💬 1 📌 0

Kara

Latest posts by Kara @karashiiro.moe