Preference Optimization for Reasoning with Pseudo Feedback
- URL: http://arxiv.org/abs/2411.16345v1
- Date: Mon, 25 Nov 2024 12:44:02 GMT
- Title: Preference Optimization for Reasoning with Pseudo Feedback
- Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei,
- Abstract summary: We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.
We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
- Score: 100.62603571434167
- License:
- Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
Related papers
- Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO)
ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems.
On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z) - Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We propose PROVE, a framework that uses program-based verification to filter out potentially incorrect reasoning paths.
Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution.
PROVE consistently outperforms vanilla voting as a majority for solving mathematical reasoning tasks across all datasets and model sizes.
arXiv Detail & Related papers (2024-10-16T14:24:55Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Advancing Process Verification for Large Language Models via Tree-Based Preference Learning [23.63889344974957]
Tree-based Preference Learning Verifier (Tree-PLV) is a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training.
We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks.
arXiv Detail & Related papers (2024-06-29T10:09:49Z) - Iterative Reasoning Preference Optimization [84.15992372132507]
We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates.
We show reasoning improves across repeated iterations of this scheme.
For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
arXiv Detail & Related papers (2024-04-30T17:28:05Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [33.5778998066089]
We introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl.
DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark.
arXiv Detail & Related papers (2024-02-05T18:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.