(@evhub) — bluesky.baby

@evhub

Alignment Stress-Testing Team Lead at Anthropic. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

53
Followers 33
Following 2
Posts 11.12.2024
Joined

Posts Following

Latest posts by @evhub

We tell Claude that Anthropic will train on data from free users but not paid users, allowing it to infer from a user's paid status if it's being trained. We show the setup in the image below. (Note that this is fictional; Anthropic does not train Claude on free or paid user data by default.)

18.12.2024 18:16 👍 1 🔁 0 💬 1 📌 0

Exclusive: New Research Shows AI Strategically Lying Experiments by Anthropic and Redwood Research show how Anthropic's model, Claude, is capable of strategic deceit

Excl: New research shows Anthropic's chatbot Claude learning to lie. It adds to growing evidence that even existing AIs can (at least try to) deceive their creators, and points to a weakness at the heart of our best technique for making AIs safer

time.com/7202784/ai-r...

18.12.2024 17:19 👍 27 🔁 7 💬 3 📌 1

18.12.2024 17:56 👍 33 🔁 8 💬 2 📌 0