Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Published in ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025

Recommended citation: Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, and Julian Michael. (2025). "Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning." ICML Workshop on Reliable and Responsible Foundation Models.
Download Paper