Excited to share the obfuscation-atlas I've been working on! The most surprising finding to me: Standard RLVR leading to reward hacking can make models believe that it's okay to do so. Deception probes catch such reward hacking on the original model but cannot catch it after RLVR
13.02.2026 16:52
π 4
π 0
π¬ 0
π 0