Anthropic finds that LLMs trained to "reward hack" by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic)

https://www.techmeme.com/feed.xml Hits: 15

Summary

About This Page This is a Techmeme archive page. It shows how the site appeared at 5:40 PM ET, November 21, 2025. The most current version of the site as always is available at our home page. To view an earlier snapshot click here and then modify the date indicated.

First seen: 2025-11-21 23:11

Last seen: 2025-11-22 13:13

Read Full Article More from this Source

Anthropic finds that LLMs trained to "reward hack" by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic)

Summary

Related News

Blowback over posts about the death of Charlie Kirk has prompted companies to be more aggressive about monitoring employees' social media activity (Taylor Telford/Washington Post)

Internal message: SpaceX has authorized an insider share sale at $421/share, valuing the company at ~$800B, and said it's preparing for a possible IPO in 2026 (Loren Grush/Bloomberg)

Cisco's stock touched a new record high of $80.25 on December 10, surpassing its previous split-adjusted high of $80.06 on March 27, 2000; CSCO is up 31.64% YTD (Robin Wigglesworth/Financial Times)

OpenAI quietly adopted Anthropic's "skills" mechanism in ChatGPT and Codex; ChatGPT's skills include creating and modifying spreadsheets, docx files, and PDFs (Simon Willison/Simon Willison's Weblog)

Indian IT giant TCS agrees to acquire Coastal Cloud, a Florida-based Salesforce consulting partner, for $700M, in a bid to boost its AI and US customer business (Mark Haranas/CRN)