Related papers: Iterative Reasoning Preference Optimization

Iterative Reasoning Preference Optimization

URL: http://arxiv.org/abs/2404.19733v3
Date: Wed, 26 Jun 2024 01:28:35 GMT
Title: Iterative Reasoning Preference Optimization
Authors: Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston,
Abstract summary: We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates. We show reasoning improves across repeated iterations of this scheme. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
Score: 84.15992372132507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.

Related papers

Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO) ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z)
PORT: Preference Optimization on Reasoning Traces [1.7292887546437081]
This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the reasoning performances of language models. Our approach leads to increased accuracy on the GSM8K, AQuA-RAT, and ARC benchmarks for Falcon2-11B and Mistral-7B.
arXiv Detail & Related papers (2024-06-23T09:51:06Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Reward Model Ensembles Help Mitigate Overoptimization [7.715463015544845]
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As imperfect representations of the "true" reward, learned reward models are susceptible to overoptimization.
arXiv Detail & Related papers (2023-10-04T11:34:22Z)
Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance [53.49803579981569]
We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. Existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result. We propose a memory-efficient optimization algorithm for solving the Global Contrastive Learning of Representations, named SogCLR.
arXiv Detail & Related papers (2022-02-24T22:16:53Z)
Predict and Optimize: Through the Lens of Learning to Rank [9.434400627011108]
We show the noise contrastive estimation can be considered a case of learning to rank the solution cache. We also develop pairwise and listwise ranking loss functions, which can be differentiated in closed form without the need of solving the optimization problem.
arXiv Detail & Related papers (2021-12-07T10:11:44Z)
RSO: A Novel Reinforced Swarm Optimization Algorithm for Feature Selection [0.0]
In this paper, we propose a novel feature selection algorithm named Reinforced Swarm Optimization (RSO) This algorithm embeds the widely used Bee Swarm Optimization (BSO) algorithm along with Reinforcement Learning (RL) to maximize the reward of a superior search agent and punish the inferior ones. The proposed method is evaluated on 25 widely known UCI datasets containing a perfect blend of balanced and imbalanced data.
arXiv Detail & Related papers (2021-07-29T17:38:04Z)
Stochastic Optimization Forests [60.523606291705214]
We show how to train forest decision policies by growing trees that choose splits to directly optimize the downstream decision quality, rather than splitting to improve prediction accuracy as in the standard random forest algorithm. We show that our approximate splitting criteria can reduce running time hundredfold, while achieving performance close to forest algorithms that exactly re-optimize for every candidate split.
arXiv Detail & Related papers (2020-08-17T16:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.