Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
- URL: http://arxiv.org/abs/2505.12514v1
- Date: Sun, 18 May 2025 18:36:53 GMT
- Title: Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
- Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian,
- Abstract summary: We prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem.<n>In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously.
- Score: 56.71873693264532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.
Related papers
- Continuous Chain of Thought Enables Parallel Exploration and Reasoning [38.59659461841282]
Current language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary.<n>Our work examines the benefits of continuously-valued tokens (CoT2) through logical reasoning tasks.<n>We show that CoT2 allows the model to track multiple traces in parallel and quantify its benefits for inference efficiency.
arXiv Detail & Related papers (2025-05-29T16:58:28Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Supervised Chain of Thought [5.389461633686935]
Chain of Thought (CoT) prompting offers a promising approach to solving complex reasoning tasks.
One-prompt-for-all approach poses significant challenges for models to generate the correct reasoning steps.
We show how task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance.
arXiv Detail & Related papers (2024-10-18T06:25:27Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - Transformers Provably Solve Parity Efficiently with Chain of Thought [40.78854925996]
This work provides the first theoretical analysis of training transformers to solve complex problems.<n>We consider training a one-layer transformer to solve the fundamental $k$-parity problem.
arXiv Detail & Related papers (2024-10-11T08:55:17Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.
We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models [74.40196814292426]
We propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph.
GoT captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes.
We evaluate GoT's performance on a text-only reasoning task and a multimodal reasoning task.
arXiv Detail & Related papers (2023-05-26T02:15:09Z) - Scaling up Stochastic Gradient Descent for Non-convex Optimisation [5.908471365011942]
We propose a novel approach to the problem of shared parallel computation.
By combining two strategies into a unified framework, DPSGD is a better trade computation framework.
The potential gains can be achieved by DPSGD on a deep learning (DRL) problem (Latent Diletrichal inference) and on a deep learning (DRL) problem (advantage actor - A2C)
arXiv Detail & Related papers (2022-10-06T13:06:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.