Related papers: Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks

Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks

URL: http://arxiv.org/abs/2602.00449v1
Date: Sat, 31 Jan 2026 01:48:23 GMT
Title: Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks
Authors: Jia Liang, Liangming Pan,
Abstract summary: Latent Chain-of-Thought (Latent-CoT) aims to enable step-by-step computation without emitting long rationales.<n>We study CODI, a continuous-thought teacher-student distillation model, on strictly sequential-it tasks.
Score: 37.23113613155819
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent Chain-of-Thought (Latent-CoT) aims to enable step-by-step computation without emitting long rationales, yet its mechanisms remain unclear. We study CODI, a continuous-thought teacher-student distillation model, on strictly sequential polynomial-iteration tasks. Using logit-lens decoding, linear probes, attention analysis, and activation patching, we localize intermediate-state representations and trace their routing to the final readout. On two- and three-hop tasks, CODI forms the full set of bridge states that become decodable across latent-thought positions, while the final input follows a separate near-direct route; predictions arise via late fusion at the end-of-thought boundary. For longer hop lengths, CODI does not reliably execute a full latent rollout, instead exhibiting a partial latent reasoning path that concentrates on late intermediates and fuses them with the last input at the answer readout position. Ablations show that this partial pathway can collapse under regime shifts, including harder optimization. Overall, we delineate when CODI-style latent-CoT yields faithful iterative computation versus compressed or shortcut strategies, and highlight challenges in designing robust latent-CoT objectives for sequential reasoning.

Related papers

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization [9.193078163792427]
Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems.<n>Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states.<n>We introduce PLaT, a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization.
arXiv Detail & Related papers (2026-01-29T07:38:18Z)
Parallel Test-Time Scaling for Latent Reasoning Models [58.428340345068214]
Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs)<n>Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought.<n>This work enables parallel TTS for latent reasoning models by addressing the above issues.
arXiv Detail & Related papers (2025-10-09T03:33:00Z)
Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z)
Early Stopping Chain-of-thoughts in Large Language Models [12.243498785781087]
ES-CoT is an inference-time method that shortens chain-of-thought generation by detecting answer convergence.<n> ES-CoT reduces the number of inference tokens by about 41% on average while maintaining accuracy comparable to standard CoT.
arXiv Detail & Related papers (2025-09-17T14:14:05Z)
Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z)
Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer [0.8738725605667471]
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning.<n>In standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency.<n>We investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count.
arXiv Detail & Related papers (2025-07-02T23:35:21Z)
EPiC: Towards Lossless Speedup for Reasoning Training through Edge-Preserving CoT Condensation [37.6583581020347]
We study the problem of CoT condensation for resource-efficient reasoning training.<n>We propose an Edge-Preserving Condensation method, EPiC, which selectively retains only the initial and final segments of each CoT trace.
arXiv Detail & Related papers (2025-06-04T17:49:10Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.