Related papers: RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

URL: http://arxiv.org/abs/2506.14261v2
Date: Wed, 18 Jun 2025 03:32:59 GMT
Title: RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Authors: Rohan Gupta, Erik Jenner,
Abstract summary: We introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors.<n>We find that token-level latent-space monitors are highly vulnerable to this attack.<n>We show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type.
Score: 3.661279101881241
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent-space monitors aim to detect undesirable behaviours in large language models by leveraging internal model representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions, but a critical open question remains: can LLMs learn to evade such monitors? To study this, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors while maintaining coherent generations. We apply RL-Obfuscation to LLMs ranging from 7B to 14B parameters and evaluate evasion success against a suite of monitors. We find that token-level latent-space monitors are highly vulnerable to this attack. More holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, we show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type. Finally, we study how the policy learned by RL bypasses these monitors and find that the model can also learn to repurpose tokens to mean something different internally.

Related papers

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [5.214050557192032]
sandbagging is the strategic underperformance on evaluations by AI models or their developers.<n>One promising defense is to monitor a model's chain-of-thought (CoT) reasoning.<n>We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.
arXiv Detail & Related papers (2025-07-31T15:19:30Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors [2.07180164747172]
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions.<n>We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach.
arXiv Detail & Related papers (2025-05-20T12:49:58Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z)
Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring [18.837335987273256]
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear.<n>We propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors.
arXiv Detail & Related papers (2025-02-07T13:25:33Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [50.463399903987245]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment. Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z)
Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD. The framework equips the UAD with continual learning capability through contrastively-learned prompts. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z)
Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.