Preference Optimization for Reasoning with Pseudo Feedback
- URL: http://arxiv.org/abs/2411.16345v1
- Date: Mon, 25 Nov 2024 12:44:02 GMT
- Title: Preference Optimization for Reasoning with Pseudo Feedback
- Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei,
- Abstract summary: We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.
We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
- Score: 100.62603571434167
- License:
- Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
Related papers
- EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models.
We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories.
Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates [51.633266497799745]
hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space.
We introduce three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs; and (iii) a brand new inference scaling system.
arXiv Detail & Related papers (2025-02-10T18:51:47Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - PORT: Preference Optimization on Reasoning Traces [1.7292887546437081]
This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the mathematical reasoning performances of language models.
Our approach leads to increased accuracy on the GSM8K and AQuA-RAT mathematical reasoning benchmarks for Falcon2-11B and Mistral-7B.
The improved abilities transfer to non-mathematical tasks, including the ARC benchmark and symbolic reasoning challenges.
arXiv Detail & Related papers (2024-06-23T09:51:06Z) - Iterative Reasoning Preference Optimization [84.15992372132507]
We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates.
We show reasoning improves across repeated iterations of this scheme.
For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
arXiv Detail & Related papers (2024-04-30T17:28:05Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [33.5778998066089]
We introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl.
DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark.
arXiv Detail & Related papers (2024-02-05T18:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.