Iterative Reasoning Preference Optimization
- URL: http://arxiv.org/abs/2404.19733v3
- Date: Wed, 26 Jun 2024 01:28:35 GMT
- Title: Iterative Reasoning Preference Optimization
- Authors: Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston,
- Abstract summary: We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates.
We show reasoning improves across repeated iterations of this scheme.
For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
- Score: 84.15992372132507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
Related papers
- Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO)
ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems.
On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z) - Influential Language Data Selection via Gradient Trajectory Pursuit [9.925547848971034]
Gradient Trajectory Pursuit (GTP) is an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective.
In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks.
arXiv Detail & Related papers (2024-10-22T05:32:40Z) - PORT: Preference Optimization on Reasoning Traces [1.7292887546437081]
This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the reasoning performances of language models.
Our approach leads to increased accuracy on the GSM8K, AQuA-RAT, and ARC benchmarks for Falcon2-11B and Mistral-7B.
arXiv Detail & Related papers (2024-06-23T09:51:06Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Reward Model Ensembles Help Mitigate Overoptimization [7.715463015544845]
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions.
As imperfect representations of the "true" reward, learned reward models are susceptible to overoptimization.
arXiv Detail & Related papers (2023-10-04T11:34:22Z) - Provable Stochastic Optimization for Global Contrastive Learning: Small
Batch Does Not Harm Performance [53.49803579981569]
We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point.
Existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result.
We propose a memory-efficient optimization algorithm for solving the Global Contrastive Learning of Representations, named SogCLR.
arXiv Detail & Related papers (2022-02-24T22:16:53Z) - Predict and Optimize: Through the Lens of Learning to Rank [9.434400627011108]
We show the noise contrastive estimation can be considered a case of learning to rank the solution cache.
We also develop pairwise and listwise ranking loss functions, which can be differentiated in closed form without the need of solving the optimization problem.
arXiv Detail & Related papers (2021-12-07T10:11:44Z) - RSO: A Novel Reinforced Swarm Optimization Algorithm for Feature
Selection [0.0]
In this paper, we propose a novel feature selection algorithm named Reinforced Swarm Optimization (RSO)
This algorithm embeds the widely used Bee Swarm Optimization (BSO) algorithm along with Reinforcement Learning (RL) to maximize the reward of a superior search agent and punish the inferior ones.
The proposed method is evaluated on 25 widely known UCI datasets containing a perfect blend of balanced and imbalanced data.
arXiv Detail & Related papers (2021-07-29T17:38:04Z) - Stochastic Optimization Forests [60.523606291705214]
We show how to train forest decision policies by growing trees that choose splits to directly optimize the downstream decision quality, rather than splitting to improve prediction accuracy as in the standard random forest algorithm.
We show that our approximate splitting criteria can reduce running time hundredfold, while achieving performance close to forest algorithms that exactly re-optimize for every candidate split.
arXiv Detail & Related papers (2020-08-17T16:56:06Z) - ScopeFlow: Dynamic Scene Scoping for Optical Flow [94.42139459221784]
We propose to modify the common training protocols of optical flow.
The improvement is based on observing the bias in sampling challenging data.
We find that both regularization and augmentation should decrease during the training protocol.
arXiv Detail & Related papers (2020-02-25T09:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.