All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
- URL: http://arxiv.org/abs/2510.09714v2
- Date: Wed, 15 Oct 2025 18:07:42 GMT
- Title: All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
- Authors: Shiyuan Guo, Henry Sleight, Fabien Roger,
- Abstract summary: Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment.<n>Attackers and misaligned models might evade CoT monitoring through ciphered reasoning.<n>We test whether models can perform encrypted ciphered reasoning.
- Score: 4.658955683408114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.
Related papers
- STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model [71.35577462669856]
We propose a robust, provably secure linguistic steganography with diffusion language models (DLMs)<n>We introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction.
arXiv Detail & Related papers (2026-01-21T08:58:12Z) - How Does Prefix Matter in Reasoning Model Tuning? [57.69882799751655]
We fine-tune three R1 series models across three core model capabilities: reasoning (mathematics), coding, safety, and factuality.<n>Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy.
arXiv Detail & Related papers (2026-01-04T18:04:23Z) - Unsupervised decoding of encoded reasoning using language model interpretability [5.139676481194603]
We investigate whether current interpretability techniques can penetrate encoded reasoning.<n>We show that logit lens can effectively translate encoded reasoning.<n>We develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing.
arXiv Detail & Related papers (2025-12-01T03:05:20Z) - ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers [0.3403377445166164]
We present cryptogram solving as an ideal testbed for studying neural network reasoning and generalization.<n>We develop ALICE, a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem.<n>Surprisingly, ALICE generalizes to unseen ciphers after training on only $sim1500$ unique ciphers.
arXiv Detail & Related papers (2025-09-08T23:33:53Z) - Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z) - Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks [10.478976654618272]
adversaries can exploit large language model fine-tuning APIs to bypass model safety mechanisms.<n>We introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety.<n>We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches.
arXiv Detail & Related papers (2025-08-23T22:55:15Z) - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Provably Secure Disambiguating Neural Linguistic Steganography [66.30965740387047]
The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures.<n>We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem.<n> SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods.
arXiv Detail & Related papers (2024-03-26T09:25:57Z) - Preventing Language Models From Hiding Their Reasoning [0.0]
Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems.
In this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning.
We show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning.
arXiv Detail & Related papers (2023-10-27T22:02:29Z) - GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains.
We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z) - Can Sequence-to-Sequence Models Crack Substitution Ciphers? [15.898270650875158]
State-of-the-art decipherment methods use beam search and a neural language model to score candidate hypotheses for a given cipher.
We show that our proposed method can decipher text without explicit language identification and can still be robust to noise.
arXiv Detail & Related papers (2020-12-30T17:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.