Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
- URL: http://arxiv.org/abs/2404.10346v4
- Date: Thu, 03 Oct 2024 03:55:41 GMT
- Title: Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
- Authors: Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, Minjoon Seo,
- Abstract summary: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs)
We propose Self-Explore, where the LLM is tasked to explore the first wrong step within the rationale and use such signals as fine-grained rewards for further improvement.
On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT)
- Score: 42.065997425172974
- License:
- Abstract: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
Related papers
- Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Rational Metareasoning for Large Language Models [5.5539136805232205]
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs)
This work introduces a novel approach based on computational models of metareasoning used in cognitive science.
We develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning.
arXiv Detail & Related papers (2024-10-07T23:48:52Z) - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [5.487210426671288]
In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training.
We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization.
arXiv Detail & Related papers (2024-07-25T17:59:16Z) - Training LLMs to Better Self-Debug and Explain Code [36.604898865514365]
We propose a training framework that significantly improves self-sourcedging capability of LLMs.
We propose an automated pipeline to collect a high-quality dataset for code explanation and refinement.
We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design.
arXiv Detail & Related papers (2024-05-28T23:20:24Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought [51.240387516059535]
We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., 1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks.
We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals.
arXiv Detail & Related papers (2024-04-04T12:46:37Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Large Language Models Can Self-Improve [34.78624270280148]
We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions.
We show that our approach achieves state-of-the-art-level performance, without any ground truth label.
arXiv Detail & Related papers (2022-10-20T21:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.