Related papers: RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

URL: http://arxiv.org/abs/2506.14261v3
Date: Thu, 25 Sep 2025 19:43:01 GMT
Title: RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Authors: Rohan Gupta, Erik Jenner,
Abstract summary: Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations.<n>This raises an important question: can models learn to evade such monitors?<n>We introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors.<n>We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust.
Score: 6.861292004336852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.

Related papers

Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
How does information access affect LLM monitors' ability to detect sabotage? [5.941142438950269]
We study how information access affects LLM monitor performance.<n>We show that contemporary systems often perform better with less information.<n>We find that agents unaware of being monitored can be caught much more easily.
arXiv Detail & Related papers (2026-01-28T23:01:31Z)
Monitoring Monitorability [7.993120960324396]
We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
arXiv Detail & Related papers (2025-12-20T10:46:04Z)
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors [6.965453012336053]
Activation monitoring is an emerging tool for AI safety, but its robustness under misalignment threat models is untested.<n>We show that finetuning can create Neural Chameleons: models capable of zero-shot evading activation monitors.<n>Our work provides a proof-of-concept for this failure mode and a tool to evaluate the worst-case robustness of monitoring techniques against misalignment threat models.
arXiv Detail & Related papers (2025-12-12T18:47:43Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [5.214050557192032]
sandbagging is the strategic underperformance on evaluations by AI models or their developers.<n>One promising defense is to monitor a model's chain-of-thought (CoT) reasoning.<n>We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.
arXiv Detail & Related papers (2025-07-31T15:19:30Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors [2.07180164747172]
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions.<n>We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach.
arXiv Detail & Related papers (2025-05-20T12:49:58Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z)
Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring [18.837335987273256]
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear.<n>We propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors.
arXiv Detail & Related papers (2025-02-07T13:25:33Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [50.463399903987245]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment. Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z)
Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD. The framework equips the UAD with continual learning capability through contrastively-learned prompts. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z)
Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.