Alex Becker's Avatar

Alex Becker

@gputhief

Safeguards @ Anthropic San Francisco Blog: https://alexcbecker.net/blog.html

75
Followers
137
Following
41
Posts
25.01.2026
Joined
Posts Following

Latest posts by Alex Becker @gputhief

To an extent the format allows for that—you can’t accomplish 50% of 7 day tasks unless you are way over 50% of 1 day tasks

06.03.2026 16:33 👍 1 🔁 0 💬 0 📌 0

This METR eval is way beyond the scope it was originally intended to cover and has taken on a life of its own. The team knows this but I think model progress might be outstripping the rate at which they can come up with a successor.

06.03.2026 16:32 👍 1 🔁 0 💬 0 📌 0
Post image

Bad sign

06.03.2026 07:07 👍 4 🔁 1 💬 1 📌 0
Post image

This could have been SF not Shanghai

06.03.2026 07:06 👍 2 🔁 0 💬 0 📌 0

Authors attribute the poor RL results to teacher mismatch, which is plausible, but it may be that transformers just do better (or more interpretable) RL, swamping any advantage in pretraining.

06.03.2026 06:28 👍 1 🔁 0 💬 0 📌 0

...but these limitations really only apply to pretraining. A fixed depth transformer can't solve certain classes of problems in a *single* forward pass, which might show up in pretraining loss, but a reasoning model can simulate arbitrary circuit depths in thinking tokens.

2/3

06.03.2026 06:26 👍 2 🔁 0 💬 1 📌 0

AI2 did a near 1:1 comparison between pure transformer and hybrid archs: allenai.org/papers/olmo-...

Pretraining: hybrid gated deltanet clearly wins
RL: mixed at best

They also point out theoretical limitations of transformers' fixed circuit depths

1/2

06.03.2026 06:23 👍 7 🔁 0 💬 1 📌 0
Preview
The Professor of Parody The hip defeatism of Judith Butler

ok but newrepublic.com/article/1506...

06.03.2026 06:16 👍 1 🔁 0 💬 0 📌 0

not having to use gimp is right up there with not having to write bash as a QoL improvement from AI

05.03.2026 07:06 👍 1 🔁 0 💬 0 📌 0

no offense but if your sole metric for belief evaluation is "which of these make me feel the best" you're just epistemically completely turbofucked. maybe there's a nicer way to phrase that but that's the gist. like I'd love to believe I'm impervious to disease, and bullets. but

11.02.2026 19:17 👍 30 🔁 2 💬 6 📌 0
Preview
A Deep Dive into the MiniMax-M2-her

At long last we have built Her, from the classic Sci-Fi movie Don't Build Her.

www.minimax.io/news/a-deep-...

03.03.2026 07:48 👍 4 🔁 0 💬 0 📌 0

wait you're telling me I lived 2 blocks from the pope

03.03.2026 07:21 👍 1 🔁 0 💬 0 📌 0

right I have no problem with decapitation strikes per se but in this case it's not clear what red line we're trying to enforce for the next dictator who makes trouble

03.03.2026 01:46 👍 1 🔁 0 💬 0 📌 0

is it clear what precisely we are deterring?

02.03.2026 05:00 👍 3 🔁 0 💬 1 📌 0

Another favorite: www.usenix.org/system/files...

02.03.2026 04:55 👍 1 🔁 0 💬 0 📌 0

I'm a simple man; I see Mickens, I repost.

02.03.2026 04:53 👍 1 🔁 0 💬 2 📌 0

WELL WELL WELL NOT SO EASY TO FIND A PLAN TO TERMINATE A CONFLICT THAT DOESN’T SUCK SHIT HUH?

02.03.2026 03:14 👍 994 🔁 126 💬 20 📌 3

what was this in reply to?

28.02.2026 18:59 👍 3 🔁 0 💬 1 📌 0
Preview
Statement on the comments from Secretary of War Pete Hegseth Anthropic's response to the Secretary of War and advice for customers

Proud of Anthropic for holding the line, hope folks at other labs will look closely at what they're agreeing to wrt the DoD.

A shame because we need a strong, rational DoD, not one looking to fight imaginary culture war enemies.

www.anthropic.com/news/stateme...

28.02.2026 01:51 👍 12 🔁 1 💬 1 📌 0

It is both unprecedented and largely nonsensical--it should have very little effect beyond just cancelling the contracts, except making Pete feel like a big strong man I guess.

28.02.2026 01:44 👍 1 🔁 0 💬 0 📌 0

This is very cool! I'm curious to see the failure cases. I imagine it has very good reliability for certain task/attacker objective combos (i.e. when the attacker wants to call a tool the task doesn't require) but I don't see how it can handle cases where the attack is using a legit tool.

26.02.2026 05:15 👍 1 🔁 0 💬 1 📌 0
Preview
Anthropic acquires Vercept to advance Claude's computer use capabilities Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

I may not have gotten that Vercept job, but I did end up at the same place starting on the same day.

Funny how life works!

www.anthropic.com/news/acquire...

26.02.2026 04:12 👍 13 🔁 0 💬 1 📌 0

may i recommend moving into an apartment in the most expensive city on earth while still owning your home several states away and debating how much of a loss you're willing to take selling it

25.02.2026 07:05 👍 2 🔁 0 💬 1 📌 0

Currently the answer is yes, attacks transfer pretty well across models! See e.g. arxiv.org/pdf/2307.15043 (old but still applicable)

But I think stopping attacks from transferring is more tenable than stopping adversarial optimization against the model itself.

24.02.2026 16:19 👍 2 🔁 0 💬 0 📌 0

it's a good approach but very limiting for many use cases

24.02.2026 03:47 👍 3 🔁 0 💬 1 📌 0

Are you envisioning deterministic access controls or the model itself specifying access controls before it sees the untrusted data?

24.02.2026 01:16 👍 2 🔁 0 💬 1 📌 0

White-box access to a model or a sufficiently close distillation of that model allows adversarial optimization of attacks, which is how the strongest attacks (and the ones I'm least optimistic about handling) are created.

24.02.2026 01:07 👍 1 🔁 0 💬 0 📌 1

Yes that is the threat model I'm talking about. Obviously only one of many and not the most important one but still, would like for us to solve it!

24.02.2026 00:03 👍 0 🔁 0 💬 1 📌 0

thread is starting to become a tree

bsky.app/profile/gput...

23.02.2026 23:59 👍 1 🔁 0 💬 0 📌 0

Prompt injection isn't about getting a model to do something the model author doesn't want. It's about third parties getting the model to do something the model user doesn't want.

23.02.2026 23:56 👍 1 🔁 0 💬 1 📌 1