Soft Tokens, Hard Truths
- URL: http://arxiv.org/abs/2509.19170v2
- Date: Wed, 24 Sep 2025 11:28:42 GMT
- Title: Soft Tokens, Hard Truths
- Authors: Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier,
- Abstract summary: This work introduces a scalable method to learn continuous CoTs via reinforcement learning (RL)<n>We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration.<n>On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32.
- Score: 17.640897774014707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
Related papers
- Beyond Imitation: Reinforcement Learning for Active Latent Planning [18.05072303874982]
latent reasoning methods fine-tune Large Language Models to substitute discrete language tokens with continuous latent tokens.<n>Current latent tokens are generally supervised based on imitating language labels.<n>We propose ATP-Latent to model the supervision process of latent tokens as a conditional variational auto-encoder.
arXiv Detail & Related papers (2026-01-29T12:07:16Z) - Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge [87.51901436392427]
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT)<n>Humans, by contrast, often reason softly by maintaining a tractable probability distribution over plausible next steps.<n>We propose Multiplex Thinking, a soft reasoning mechanism that samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token.<n>Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT.
arXiv Detail & Related papers (2026-01-13T18:48:00Z) - FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z) - MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts [82.46857666702924]
We present a new paradigm for reasoning in large language models (LLMs)<n>Instead of autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts"<n>For the first time, MARCOS achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference.
arXiv Detail & Related papers (2025-09-29T16:44:22Z) - Parallel Continuous Chain-of-Thought with Jacobi Iteration [39.36822246659272]
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models.<n>We propose Parallel Continuous Chain-of-Thought (PCCoT) which performs Jacobi on the latent thought tokens, updating them iteratively in parallel instead of sequentially.
arXiv Detail & Related papers (2025-06-23T12:35:41Z) - Continuous Chain of Thought Enables Parallel Exploration and Reasoning [38.59659461841282]
Current language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary.<n>Our work examines the benefits of continuously-valued tokens (CoT2) through logical reasoning tasks.<n>We show that CoT2 allows the model to track multiple traces in parallel and quantify its benefits for inference efficiency.
arXiv Detail & Related papers (2025-05-29T16:58:28Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought [56.71873693264532]
We prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem.<n>In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously.
arXiv Detail & Related papers (2025-05-18T18:36:53Z) - Chain-of-Thought Tokens are Computer Program Variables [24.55270838267279]
Chain-of-thoughts (CoT) requires large language models to generate intermediate steps before reaching the final answer.<n>We study the role of CoT tokens in large language models on two compositional tasks.<n>We find that preserving only tokens that store intermediate results would achieve comparable performance.
arXiv Detail & Related papers (2025-05-08T05:32:36Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.