's Avatar

@evhub

Alignment Stress-Testing Team Lead at Anthropic. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

53
Followers
33
Following
2
Posts
11.12.2024
Joined
Posts Following

Latest posts by @evhub

Post image

We tell Claude that Anthropic will train on data from free users but not paid users, allowing it to infer from a user's paid status if it's being trained. We show the setup in the image below. (Note that this is fictional; Anthropic does not train Claude on free or paid user data by default.)

18.12.2024 18:16 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Exclusive: New Research Shows AI Strategically Lying Experiments by Anthropic and Redwood Research show how Anthropic's model, Claude, is capable of strategic deceit

Excl: New research shows Anthropic's chatbot Claude learning to lie. It adds to growing evidence that even existing AIs can (at least try to) deceive their creators, and points to a weakness at the heart of our best technique for making AIs safer

time.com/7202784/ai-r...

18.12.2024 17:19 ๐Ÿ‘ 27 ๐Ÿ” 7 ๐Ÿ’ฌ 3 ๐Ÿ“Œ 1
18.12.2024 17:56 ๐Ÿ‘ 33 ๐Ÿ” 8 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0