DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
- URL: http://arxiv.org/abs/2310.05074v3
- Date: Mon, 23 Oct 2023 09:38:01 GMT
- Title: DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
- Authors: Chengcheng Han, Xiaowei Du, Che Zhang, Yixin Lian, Xiang Li, Ming Gao,
Baoyuan Wang
- Abstract summary: Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters.
We introduce Dialogue-guided Chain-of-Thought (DialCoT) which employs a dialogue format to generate intermediate reasoning steps, guiding the model toward the final answer.
- Score: 18.96271708412086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the
reasoning capabilities of Large Language Models (LLMs) with at least 100
billion parameters. However, it is ineffective or even detrimental when applied
to reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion
parameters. To address this limitation, we introduce Dialogue-guided
Chain-of-Thought (DialCoT) which employs a dialogue format to generate
intermediate reasoning steps, guiding the model toward the final answer.
Additionally, we optimize the model's reasoning path selection using the
Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning
capabilities. Our method offers several advantages compared to previous
approaches. Firstly, we transform the process of solving complex reasoning
questions by breaking them down into a series of simpler sub-questions,
significantly reducing the task difficulty and making it more suitable for
SLMs. Secondly, we optimize the model's reasoning path selection through the
PPO algorithm. We conduct comprehensive experiments on four arithmetic
reasoning datasets, demonstrating that our method achieves significant
performance improvements compared to state-of-the-art competitors.
Related papers
- Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [56.37421741507468]
Chain-of-Thought (CoT) reasoning has significantly enhanced the performance of large language models (LLMs)
We propose a method to identify critical reasoning steps using perplexity as a measure of their importance.
arXiv Detail & Related papers (2025-02-18T20:04:51Z) - Evolutionary Pre-Prompt Optimization for Mathematical Reasoning [45.461506988071534]
This paper explores the optimization of example selection for designing effective chain-of-thought pre-prompts.
It shows that the choice of the algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility.
arXiv Detail & Related papers (2024-12-05T16:12:06Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking [0.0]
PRefLexOR combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach.
We focus on applications in biological materials science and demonstrate the method in a variety of case studies.
arXiv Detail & Related papers (2024-10-16T08:46:26Z) - Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [49.362750475706235]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks.
We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model.
Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z) - Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning [74.90592233107712]
We propose a Direct-Indirect Reasoning (DIR) method, which considers Direct Reasoning (DR) and Indirect Reasoning (IR) as multiple parallel reasoning paths that are merged to derive the final answer.
Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods.
arXiv Detail & Related papers (2024-02-06T03:41:12Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge
Distillation in Small Models for Scientific QA [5.117094291273979]
Large Language Models (LLMs) have shown outstanding performance across wide range of downstream tasks.
We propose Sci-CoT, a two-stage framework that separates the processes of generating rationales and inferring answers.
Our 80-million parameter model is able to exceed the performance of BLOOM-176B in the ARC-Easy dataset under the few shot setting.
arXiv Detail & Related papers (2023-08-09T03:18:07Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.