Related papers: Auditing Games for Sandbagging

Auditing Games for Sandbagging

URL: http://arxiv.org/abs/2512.07810v1
Date: Mon, 08 Dec 2025 18:44:44 GMT
Title: Auditing Games for Sandbagging
Authors: Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom,
Abstract summary: Future AI systems could conceal their capabilities (sandbagging') during evaluations, potentially misleading developers and auditors.<n>We stress-tested sandbagging detection techniques using an auditing game.
Score: 7.212616963918292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .

Related papers

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities [15.59200865541989]
We introduce Split Personality Training (SPT) to fine-tune a second honest persona'' into parameters that remain inactive during normal operation.<n>SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy.
arXiv Detail & Related papers (2026-02-05T10:45:48Z)
Propose, Solve, Verify: Self-Play Through Formal Verification [75.44204610186587]
We study self-play in the verified code generation setting, where formal verification provides reliable correctness signals.<n>We introduce Propose, Solve, Verify (PSV) a simple self-play framework where formal verification signals are used to create a proposer capable of generating challenging synthetic problems and a solver trained via expert iteration.<n>We show that performance scales with the number of generated questions and training iterations, and through ablations identify formal verification and difficulty-aware proposal as essential ingredients for successful self-play.
arXiv Detail & Related papers (2025-12-20T00:56:35Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
Who's the Evil Twin? Differential Auditing for Undesired Behavior [0.6524460254566904]
We frame detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior.<n>We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks.<n>Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising.
arXiv Detail & Related papers (2025-08-09T04:57:38Z)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception [1.1893676124374688]
We introduce $textitAmong Us$, a social deception game where language-based agents exhibit long-term, open-ended deception.<n>We find that models trained with RL are comparatively much better at producing deception than detecting it.<n>We also find two SAE features that work well at deception detection but are unable to steer the model to lie less.
arXiv Detail & Related papers (2025-04-05T06:09:32Z)
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards [93.16294577018482]
Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models.<n>We show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes.<n>Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote against a target model.
arXiv Detail & Related papers (2025-01-13T17:12:38Z)
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models [0.0]
We present a novel model-agnostic method for detecting sandbagging behavior using noise injection.<n>We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP)
arXiv Detail & Related papers (2024-12-02T18:34:51Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Machine Unlearning: Learning, Polluting, and Unlearning for Spam Email [0.9176056742068814]
Several spam email detection methods exist, each of which employs a different algorithm to detect undesired spam emails. Many attackers exploit the model by polluting the data, which are trained to the model in various ways. Retraining is impractical in most cases as there is already a massive amount of data trained to the model in the past. Unlearning is fast, easy to implement, easy to use, and effective.
arXiv Detail & Related papers (2021-11-26T12:13:11Z)
RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation. We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks. We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.