Related papers: Preference Optimization for Reasoning with Pseudo Feedback

Preference Optimization for Reasoning with Pseudo Feedback

URL: http://arxiv.org/abs/2411.16345v1
Date: Mon, 25 Nov 2024 12:44:02 GMT
Title: Preference Optimization for Reasoning with Pseudo Feedback
Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei,
Abstract summary: We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
Score: 100.62603571434167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

Related papers

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents [8.441638148384389]
We introduce textbfOptimAI, a framework for solving underlineOptimization problems described in natural language. Our framework is built upon four key roles: (1) a emphformulator; (2) a emphplanner; and (3) a emphcoder and a emphcode critic. Our approach attains 88.1% accuracy on the NLP4LP dataset and 71.2% on the Optibench subset, reducing error rates by 58% and 50% respectively over prior best results.
arXiv Detail & Related papers (2025-04-23T17:45:05Z)
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing [21.119495676190127]
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways. naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. We develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample.
arXiv Detail & Related papers (2025-04-10T17:59:56Z)
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models. We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates [51.633266497799745]
hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space. We introduce three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs; and (iii) a brand new inference scaling system.
arXiv Detail & Related papers (2025-02-10T18:51:47Z)
Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO) ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z)
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We propose PROVE, a framework that uses program-based verification to filter out potentially incorrect reasoning paths. Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution. PROVE consistently outperforms vanilla voting as a majority for solving mathematical reasoning tasks across all datasets and model sizes.
arXiv Detail & Related papers (2024-10-16T14:24:55Z)
Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance. Existing direct preference learning algorithms are originally designed for the single-turn chat task. We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z)
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning [23.63889344974957]
Tree-based Preference Learning Verifier (Tree-PLV) is a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks.
arXiv Detail & Related papers (2024-06-29T10:09:49Z)
PORT: Preference Optimization on Reasoning Traces [1.7292887546437081]
This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the mathematical reasoning performances of language models. Our approach leads to increased accuracy on the GSM8K and AQuA-RAT mathematical reasoning benchmarks for Falcon2-11B and Mistral-7B. The improved abilities transfer to non-mathematical tasks, including the ARC benchmark and symbolic reasoning challenges.
arXiv Detail & Related papers (2024-06-23T09:51:06Z)
Iterative Reasoning Preference Optimization [84.15992372132507]
We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates. We show reasoning improves across repeated iterations of this scheme. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
arXiv Detail & Related papers (2024-04-30T17:28:05Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results. We develop a method that avoids introducing external resources, relying instead on perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z)
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [33.5778998066089]
We introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark.
arXiv Detail & Related papers (2024-02-05T18:55:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.