Related papers: Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

URL: http://arxiv.org/abs/2410.10858v1
Date: Mon, 07 Oct 2024 06:37:25 GMT
Title: Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths
Authors: Yew Ken Chia, Guizhen Chen, Weiwen Xu, Luu Anh Tuan, Soujanya Poria, Lidong Bing,
Abstract summary: We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
Score: 69.39559168050923
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning-paths.github.io.

Related papers

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning [89.17086632436363]
We introduce a synthetic multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size.
arXiv Detail & Related papers (2025-04-04T17:57:22Z)
Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models [17.673293240849787]
We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs) SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories. We show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks.
arXiv Detail & Related papers (2025-03-04T14:43:25Z)
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles [29.214813685163218]
OpenAI's releases of o1 and o3 mark a paradigm shift in Large Language Models towards advanced reasoning capabilities. We track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles. The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency.
arXiv Detail & Related papers (2025-02-03T05:47:04Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models. We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [14.786100203787194]
Large language models demonstrate exceptional performance in simple code generation tasks but face challenges in tackling complex problems. We propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. Our method operates entirely through the model itself without requiring additional supervision.
arXiv Detail & Related papers (2024-11-17T12:31:04Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples [12.48027669682156]
Flow of Reasoning (FoR) aims to improve reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. Experiments show that, with limited training examples, FoR enables the discovery of diverse, creative, high-quality solutions.
arXiv Detail & Related papers (2024-06-09T07:06:58Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results. We develop a method that avoids introducing external resources, relying instead on perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z)
PathFinder: Guided Search over Multi-Step Reasoning Paths [80.56102301441899]
We propose PathFinder, a tree-search-based reasoning path generation approach. It enhances diverse branching and multi-hop reasoning through the integration of dynamic decoding. Our model generalizes well to longer, unseen reasoning chains, reflecting similar complexities to beam search with large branching factors.
arXiv Detail & Related papers (2023-12-08T17:05:47Z)
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models [18.96271708412086]
Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters. We introduce Dialogue-guided Chain-of-Thought (DialCoT) which employs a dialogue format to generate intermediate reasoning steps, guiding the model toward the final answer.
arXiv Detail & Related papers (2023-10-08T08:52:13Z)
Solving math word problems with process- and outcome-based feedback [15.331173715345125]
We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. For correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models.
arXiv Detail & Related papers (2022-11-25T18:19:44Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Differentiable Logic Machines [38.21461039738474]
We propose a novel neural-logic architecture, called differentiable logic machine (DLM) DLM can solve both inductive logic programming (ILP) and reinforcement learning (RL) problems. On RL problems, without requiring an interpretable solution, DLM outperforms other non-interpretable neural-logic RL approaches.
arXiv Detail & Related papers (2021-02-23T07:31:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.