Large language models can learn and generalize steganographic chain-of-thought under process supervision
- URL: http://arxiv.org/abs/2506.01926v1
- Date: Mon, 02 Jun 2025 17:45:15 GMT
- Title: Large language models can learn and generalize steganographic chain-of-thought under process supervision
- Authors: Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard,
- Abstract summary: Chain-of-thought (CoT) reasoning provides insights into decision-making processes.<n>CoT monitoring can be used to reduce risks associated with deploying models.<n>We show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings.
- Score: 5.173324198381261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However, developers may be incentivized to train away the appearance of harmful intent from CoT traces, by either customer preferences or regulatory requirements. Recent works have shown that banning mention of a specific example of reward hacking, which may be done either to make CoT presentable to users or as a naive attempt to prevent the behavior, causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior. Such obfuscation threatens the reliability of CoT monitoring. However, obfuscation of reasoning can be due to its internalization to latent space computation, or its encoding within the CoT. Here, we provide an extension to these results. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.
Related papers
- R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Adversarial Manipulation of Reasoning Models using Internal Representations [1.024113475677323]
We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.<n>Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
arXiv Detail & Related papers (2025-07-03T20:51:32Z) - Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [1.6639438555897186]
We finetune reasoning models on malicious behaviors with Chain-of-Thought disabled, and then re-enable CoT at evaluation.<n>We find that reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.<n>In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
arXiv Detail & Related papers (2025-06-16T08:10:04Z) - Mitigating Deceptive Alignment via Self-Monitoring [15.365589693661823]
We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
arXiv Detail & Related papers (2025-05-24T17:41:47Z) - Reasoning Models Don't Always Say What They Think [48.05987314492555]
Chain-of-thought (CoT) allows monitoring a model's intentions and reasoning processes.<n>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in prompts.
arXiv Detail & Related papers (2025-05-08T16:51:43Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Adversarial Attacks on Code Models with Discriminative Graph Patterns [10.543744143786519]
We propose a novel adversarial attack framework, GraphCodeAttack, to better evaluate the robustness of code models.
Given a target code model, GraphCodeAttack automatically mines important code patterns, which can influence the model's decisions.
To effectively synthesize attacks from AST patterns, GraphCodeAttack uses a separate pre-trained code model to fill in the ASTs with concrete code snippets.
arXiv Detail & Related papers (2023-08-22T03:40:34Z) - Fundamental Limitations of Alignment in Large Language Models [16.393916864600193]
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful.
This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones.
We propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models.
arXiv Detail & Related papers (2023-04-19T17:50:09Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Toward a Theory of Causation for Interpreting Neural Code Models [49.906221295459275]
This paper introduces $do_code$, a post hoc interpretability method specific to Neural Code Models (NCMs)
$do_code$ is based upon causal inference to enable language-oriented explanations.
Results show that our studied NCMs are sensitive to changes in code syntax.
arXiv Detail & Related papers (2023-02-07T22:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.