Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
- URL: http://arxiv.org/abs/2510.04182v1
- Date: Sun, 05 Oct 2025 12:50:39 GMT
- Title: Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
- Authors: Wengao Ye, Yan Liang, Lianlei Shan,
- Abstract summary: Latent Thought Policy Optimization enhances LLM reasoning entirely at test time.<n>Experiments show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail.<n>Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements.
- Score: 5.674809920704963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
Related papers
- LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z) - Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction [14.164508061248775]
Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency.<n>We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers.<n> Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost.
arXiv Detail & Related papers (2025-12-21T05:35:07Z) - Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization [22.301471821413816]
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs)<n>We propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in conditional generation.
arXiv Detail & Related papers (2025-11-24T13:55:57Z) - In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z) - Directional Reasoning Injection for Fine-Tuning MLLMs [51.53222423215055]
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts.<n>Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning.<n>We propose Directional Reasoning Injection for Fine-Tuning (DRIFT) to solve this problem.
arXiv Detail & Related papers (2025-10-16T18:06:46Z) - HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs [54.16300997612526]
Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks.<n>This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control.<n> Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy.
arXiv Detail & Related papers (2025-09-28T16:46:12Z) - Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets [13.111181135818184]
Large language models (LLMs) have shown strong performance on complex mathematical tasks, including optimization.<n>Applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored.<n>We employ a novel benchmark of 369 instances of the College Admission Problem to evaluate LLMs across key dimensions: feasibility, stability, and optimality.
arXiv Detail & Related papers (2025-09-16T14:48:46Z) - Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z) - POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization [28.771942726400084]
We propose POT (Prompt-Only OverThinking), a black-box attack framework that employs iterative optimization to generate semantically natural adversarial prompts.<n>PoT achieves superior performance compared to other methods.
arXiv Detail & Related papers (2025-08-23T16:27:42Z) - Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z) - Hybrid Latent Reasoning via Reinforcement Learning [51.06635386903026]
We explore latent reasoning by leveraging the capabilities of large language models (LLMs) via reinforcement learning (RL)<n>We introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that integrates prior hidden states into sampled tokens with a learnable gating mechanism.<n>HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths.
arXiv Detail & Related papers (2025-05-24T01:26:16Z) - SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.