ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
- URL: http://arxiv.org/abs/2505.12717v1
- Date: Mon, 19 May 2025 05:18:58 GMT
- Title: ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
- Authors: Haoyuan Wu, Xueyi Chen, Rui Ming, Jilong Gao, Shoubo Hu, Zhuolun He, Bei Yu,
- Abstract summary: Tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure.<n>ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy.<n>Our ToTQwen3-8B model, trained with ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
- Score: 4.987786842464663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
Related papers
- LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement Learning [29.047063129464494]
Large language models (LLMs) excel at many supervised tasks but often struggle with structured reasoning unfamiliar settings.<n>This discrepancy suggests that standard fine-tuning pipelines may instill narrow, domain-specifics rather than fostering general-purpose thinking strategies.<n>We propose a "play to learn" framework that fine-tunes LLMs through reinforcement learning on a suite of seven custom logic puzzles.
arXiv Detail & Related papers (2025-06-05T09:40:47Z) - What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning [45.660562905010934]
We present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures.<n>Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree serve as stronger predictors of final performance.<n>Our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
arXiv Detail & Related papers (2025-05-28T09:12:31Z) - Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning [63.888013006686364]
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of Large Language Models (LLMs)<n>We propose RLKD, a reinforcement learning-based distillation framework guided by a novel Generative Structure Reward Model (GSRM)<n>Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning.
arXiv Detail & Related papers (2025-05-22T02:36:36Z) - Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM [11.181783720439563]
Large Language Models (LLMs) display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation.<n>RLMs often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting.<n>We introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs.
arXiv Detail & Related papers (2025-05-20T03:54:57Z) - Policy Guided Tree Search for Enhanced LLM Reasoning [3.090041654375235]
Policy-Guided Tree Search (PGTS) is a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths.<n>Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manuals or exhaustive search.
arXiv Detail & Related papers (2025-02-04T22:08:20Z) - RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement [85.08223786819532]
Existing large language models (LLMs) show exceptional problem-solving capabilities but might struggle with complex reasoning tasks.<n>We propose textbfRAG-Star, a novel RAG approach that integrates retrieved information to guide the tree-based deliberative reasoning process.<n>Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate that RAG-Star significantly outperforms previous RAG and reasoning methods.
arXiv Detail & Related papers (2024-12-17T13:05:36Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up [9.42385235462794]
Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning.<n>We propose Reversal of Thought (RoT) to enhance the logical reasoning abilities of LLMs during the warm-up phase prior to batch inference.<n>RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning.
arXiv Detail & Related papers (2024-10-16T07:44:28Z) - On the Empirical Complexity of Reasoning and Planning in LLMs [29.588100727466976]
Chain-of-thought (CoT), tree-of-thought (ToT), and related techniques work surprisingly well in practice for some complex reasoning tasks with Large Language Models (LLMs)
This work seeks the underlying reasons by conducting experimental case studies and linking the performance benefits to well-established sample and computational complexity principles in machine learning.
arXiv Detail & Related papers (2024-04-17T03:34:27Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.