Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing
- URL: http://arxiv.org/abs/2402.00658v3
- Date: Tue, 15 Oct 2024 09:16:38 GMT
- Title: Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing
- Authors: Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty,
- Abstract summary: We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
- Score: 61.98556945939045
- License:
- Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo.
Related papers
- Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [49.42133807824413]
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks.
Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training.
OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification.
arXiv Detail & Related papers (2025-02-18T04:11:29Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.
We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.28188747489769]
We propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge.
In a self-training loop, EvalPlanner iteratively optimize over synthetically constructed evaluation plans and executions.
Our method achieves a new state-of-the-art performance for generative reward models on RewardBench.
arXiv Detail & Related papers (2025-01-30T02:21:59Z) - Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models [33.13238566815798]
Large Language Models (LLMs) have sparked significant research interest in leveraging them to tackle complex reasoning tasks.
Recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can significantly boost reasoning accuracy.
The introduction of OpenAI's o1 series marks a significant milestone in this research direction.
arXiv Detail & Related papers (2025-01-16T17:37:58Z) - PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment [20.053439187190914]
We develop PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping.
Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.
arXiv Detail & Related papers (2024-11-18T16:03:51Z) - The Role of Deductive and Inductive Reasoning in Large Language Models [37.430396755248104]
We propose the Deductive and InDuctive(DID) method to enhance Large Language Models (LLMs) reasoning.
DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy.
Our results demonstrate significant improvements in reasoning quality and solution accuracy.
arXiv Detail & Related papers (2024-10-03T18:30:47Z) - Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks [68.49251303172674]
State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness.
Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness.
We introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning.
arXiv Detail & Related papers (2024-10-02T11:26:02Z) - Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data [53.433309883370974]
This work explores the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance Large Language Models' reasoning capabilities.
Our experiments, conducted on two established natural language reasoning tasks, demonstrate that supervised fine-tuning with synthetic graph-based reasoning data effectively enhances LLMs' reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
arXiv Detail & Related papers (2024-09-19T03:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.