Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning
- URL: http://arxiv.org/abs/2502.11169v1
- Date: Sun, 16 Feb 2025 15:39:57 GMT
- Title: Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning
- Authors: Qingwen Lin, Boyan Xu, Zijian Li, Zhifeng Hao, Keli Zhang, Ruichu Cai,
- Abstract summary: Long Chain-of-Thoughts (CoTs) have gained widespread attention for improving the reasoning capabilities of Large Language Models (LLMs)
We propose constraining the action space and guiding the emergence of Long CoTs through a refined search strategy.
Our method enables the 7B model to achieve reasoning capabilities that surpass those of the 72B model.
- Score: 21.71105748608989
- License:
- Abstract: Recently, Long Chain-of-Thoughts (CoTs) have gained widespread attention for improving the reasoning capabilities of Large Language Models (LLMs). This necessitates that existing LLMs, which lack the ability to generate Long CoTs, to acquire such capability through post-training methods. Without additional training, LLMs typically enhance their mathematical reasoning abilities through inference scaling methods such as MCTS. However, they are hindered by the large action space and inefficient search strategies, making it challenging to generate Long CoTs effectively. To tackle this issue, we propose constraining the action space and guiding the emergence of Long CoTs through a refined search strategy. In our proposed Constrained Monte Carlo Tree Search (C-MCTS) framework, we limit the actions selected from a constrained action space, which is divided into five disjoint subsets: \emph{understanding}, \emph{planning}, \emph{reflection}, \emph{coding}, and \emph{summary}. Each subset is further constrained to a small number of predefined prompts, rather than allowing LLMs to generate actions arbitrarily. Additionally, we refine the search strategy by incorporating prior knowledge about the action sets, such as a human-like partial order of the action subsets and the pretrained process reward models. These strategies work together to significantly reduce the vast search space of Long CoTs. Extensive evaluations on mathematical reasoning benchmarks show that, under zero-shot settings, our method enables the 7B model to achieve reasoning capabilities that surpass those of the 72B model.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.
We propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)
However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?
In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Policy Guided Tree Search for Enhanced LLM Reasoning [3.090041654375235]
Policy-Guided Tree Search (PGTS) is a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths.
Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manuals or exhaustive search.
arXiv Detail & Related papers (2025-02-04T22:08:20Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.
Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.
We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - LiteSearch: Efficacious Tree Search for LLM [70.29796112457662]
This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget.
Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach enjoys significantly lower computational costs compared to baseline methods.
arXiv Detail & Related papers (2024-06-29T05:14:04Z) - Break the Chain: Large Language Models Can be Shortcut Reasoners [18.047917626825548]
Chain-of-Thought (CoT) reasoning utilize complex modules but are hampered by high token consumption, limited applicability, and challenges in thinking.
This paper conducts a critical evaluation of CoT prompting, extending beyond arithmetic to include complex logical and commonsense reasoning tasks.
We propose the integration of human-likes and shortcuts into language models (LMs) through "break the chain" strategies.
arXiv Detail & Related papers (2024-06-04T14:02:53Z) - Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge
Distillation in Small Models for Scientific QA [5.117094291273979]
Large Language Models (LLMs) have shown outstanding performance across wide range of downstream tasks.
We propose Sci-CoT, a two-stage framework that separates the processes of generating rationales and inferring answers.
Our 80-million parameter model is able to exceed the performance of BLOOM-176B in the ARC-Easy dataset under the few shot setting.
arXiv Detail & Related papers (2023-08-09T03:18:07Z) - Continuous Monte Carlo Graph Search [61.11769232283621]
Continuous Monte Carlo Graph Search ( CMCGS) is an extension of Monte Carlo Tree Search (MCTS) to online planning.
CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance.
It can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models.
arXiv Detail & Related papers (2022-10-04T07:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.