Unsupervised decoding of encoded reasoning using language model interpretability
- URL: http://arxiv.org/abs/2512.01222v2
- Date: Sat, 06 Dec 2025 00:36:45 GMT
- Title: Unsupervised decoding of encoded reasoning using language model interpretability
- Authors: Ching Fang, Samuel Marks,
- Abstract summary: We investigate whether current interpretability techniques can penetrate encoded reasoning.<n>We show that logit lens can effectively translate encoded reasoning.<n>We develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing.
- Score: 5.139676481194603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models become increasingly capable, there is growing concern that they may develop reasoning processes that are encoded or hidden from human oversight. To investigate whether current interpretability techniques can penetrate such encoded reasoning, we construct a controlled testbed by fine-tuning a reasoning model (DeepSeek-R1-Distill-Llama-70B) to perform chain-of-thought reasoning in ROT-13 encryption while maintaining intelligible English outputs. We evaluate mechanistic interpretability methods--in particular, logit lens analysis--on their ability to decode the model's hidden reasoning process using only internal activations. We show that logit lens can effectively translate encoded reasoning, with accuracy peaking in intermediate-to-late layers. Finally, we develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing, achieving substantial accuracy in reconstructing complete reasoning transcripts from internal model representations. These findings suggest that current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously understood. Our work provides an initial framework for evaluating interpretability methods against models that reason in non-human-readable formats, contributing to the broader challenge of maintaining oversight over increasingly capable AI systems.
Related papers
- Fluid Representations in Reasoning Models [91.77876704697779]
We present a mechanistic analysis of how QwQ-32B processes abstract structural information.<n>We find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning.
arXiv Detail & Related papers (2026-02-04T18:34:50Z) - Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models [82.79223371188756]
Chain-of-Thought (CoT) prompting has advanced task-solving capabilities in natural language processing with large language models.<n>Applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible.<n>We introduce pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning.
arXiv Detail & Related papers (2025-12-24T05:25:17Z) - All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language [4.658955683408114]
Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment.<n>Attackers and misaligned models might evade CoT monitoring through ciphered reasoning.<n>We test whether models can perform encrypted ciphered reasoning.
arXiv Detail & Related papers (2025-10-10T06:01:22Z) - Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning [65.20602712957725]
Caco is a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data.<n>Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
arXiv Detail & Related papers (2025-10-05T07:59:24Z) - ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [0.0]
We show that language models can generate deceptive explanations that evade detection.<n>Our agents employ steganographic methods to hide information in seemingly innocent explanations.<n>All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
arXiv Detail & Related papers (2025-04-10T15:07:10Z) - Understanding Hidden Computations in Chain-of-Thought Reasoning [0.0]
Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning abilities of large language models.<n>Recent studies have shown that models can still perform complex reasoning tasks even when the CoT is replaced with filler(hidden) characters.
arXiv Detail & Related papers (2024-12-05T18:43:11Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Preventing Language Models From Hiding Their Reasoning [0.0]
Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems.
In this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning.
We show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning.
arXiv Detail & Related papers (2023-10-27T22:02:29Z) - LogiGAN: Learning Logical Reasoning via Adversarial Pre-training [58.11043285534766]
We present LogiGAN, an unsupervised adversarial pre-training framework for improving logical reasoning abilities of language models.
Inspired by the facilitation effect of reflective thinking in human learning, we simulate the learning-thinking process with an adversarial Generator-Verifier architecture.
Both base and large size language models pre-trained with LogiGAN demonstrate obvious performance improvement on 12 datasets.
arXiv Detail & Related papers (2022-05-18T08:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.