Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
- URL: http://arxiv.org/abs/2601.20103v1
- Date: Tue, 27 Jan 2026 22:45:43 GMT
- Title: Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
- Authors: Darshan Deshpande, Anand Kannappan, Rebecca Qian,
- Abstract summary: We introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories.<n>Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings.
- Score: 2.1541334033342103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions [0.0]
This study investigates the use of large language models (LLMs) for structured qualitative coding.<n>We classified U.S. Supreme Court case summaries into 21 major policy domains.<n>ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses.
arXiv Detail & Related papers (2025-07-18T22:16:04Z) - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Hard-to-Learn Data [15.490968013867562]
Vulnerability detection is crucial for identifying security weaknesses in software systems.<n>This paper proposes a novel method to significantly enhance the active learning process by using dataset maps.<n>Our approach systematically identifies samples that are hard-to-learn for a model and integrates this information to create a more sophisticated sample selection strategy.
arXiv Detail & Related papers (2025-06-25T13:50:21Z) - Leveraging VAE-Derived Latent Spaces for Enhanced Malware Detection with Machine Learning Classifiers [0.0]
This paper assesses the performance of five machine learning classifiers: Decision Tree, Naive Bayes, LightGBM, Logistic Regression, and Random Forest.<n>Results from the experiments conducted on different training-test splits with different random seeds reveal that all the models perform well in detecting malware.
arXiv Detail & Related papers (2025-03-24T14:44:55Z) - Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Revisiting DETR Pre-training for Object Detection [24.372444866927538]
We investigate the shortcomings of DETReg in enhancing the performance of robust DETR-based models under full data conditions.
We employ an optimized approach named Simple Self-training which leads to marked enhancements through the combination of an improved box predictor and the Objects$365$ benchmark.
The culmination of these endeavors results in a remarkable AP score of $59.3%$ on the COCO val set, outperforming $mathcalH$-Deformable-DETR + Swin-L without pre-training by $1.4%$.
arXiv Detail & Related papers (2023-08-02T17:39:30Z) - Boosting Out-of-Distribution Detection with Multiple Pre-trained Models [41.66566916581451]
Post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems.
We propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models.
Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.
arXiv Detail & Related papers (2022-12-24T12:11:38Z) - Robust Out-of-distribution Detection for Neural Networks [51.19164318924997]
We show that existing detection mechanisms can be extremely brittle when evaluating on in-distribution and OOD inputs.
We propose an effective algorithm called ALOE, which performs robust training by exposing the model to both adversarially crafted inlier and outlier examples.
arXiv Detail & Related papers (2020-03-21T17:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.