Related papers: Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

URL: http://arxiv.org/abs/2509.26314v2
Date: Mon, 06 Oct 2025 15:15:21 GMT
Title: Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Authors: Hanwen Du, Yuxin Dong, Xia Ning,
Abstract summary: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language.<n>Recent work proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations.<n>We show how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes.
Score: 16.941385792353493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.

Related papers

Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning [11.437063355666593]
We propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality.<n>Our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research.
arXiv Detail & Related papers (2025-10-31T13:41:30Z)
SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration [49.290631188365786]
Long chain-of-thought (LongCoT) is central to the recent breakthroughs achieved by large language models in complex reasoning tasks.<n>We propose a simple yet effective reasoning strategy: the SmartSwitch inference framework.<n>This framework can be easily integrated into any large language model as a plug-and-play solution.
arXiv Detail & Related papers (2025-10-22T16:56:01Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking [25.468889616586363]
We investigate the Soft Thinking capabilities of large language models (LLMs)<n>Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that LLMs behave as single-threaded reasoners.<n>Our experiments demonstrate that randomness--particularly with the Gumbel-max trick--can alleviate the limitations of vanilla approaches.
arXiv Detail & Related papers (2025-08-05T13:38:33Z)
Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z)
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [86.79757571440082]
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks.<n>We identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts.<n>We propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts.
arXiv Detail & Related papers (2025-01-30T18:58:18Z)
Table as Thought: Exploring Structured Thoughts in LLM Reasoning [14.901120719649315]
Large language models' reasoning abilities benefit from methods that organize their thought processes.<n>Existing approaches focus primarily on organizing the sequence of thoughts, leaving structure in individual thought steps underexplored.<n>We propose Table as Thought, a framework inspired by cognitive neuroscience theories on human thought.
arXiv Detail & Related papers (2025-01-04T00:58:06Z)
Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer [35.8785976088927]
We introduce the Thought Space Explorer'' (TSE) to expand and optimize thought structures for large language models (LLMs)<n>By generating new reasoning steps and branches based on the original thought structure, TSE broadens the thought exploration view and alleviates the impact of blind spots for LLM reasoning.
arXiv Detail & Related papers (2024-10-31T17:12:14Z)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models [50.97705264224828]
We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models. We aim for the models to engage with and generate responses that span a wider contextual scene understanding. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
arXiv Detail & Related papers (2024-03-20T11:27:20Z)
Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z)
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation [42.472954457731355]
We introduce a novel thought prompting approach called "Everything of Thoughts" (XoT) to defy the law of "Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts. We evaluate XoT on several challenging multi-solution problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube.
arXiv Detail & Related papers (2023-11-07T12:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.