Related papers: CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization

CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization

URL: http://arxiv.org/abs/2502.18532v1
Date: Tue, 25 Feb 2025 03:07:02 GMT
Title: CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Authors: Shuming Shi, Ruobing Zuo, Gaolei He, Jianlin Wang, Chenyang Xu, Zhengfeng Yang,
Abstract summary: This paper introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method.<n>We propose a method for constructing preference data which utilizes LLMs and existing theorem proving data.<n>We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model.
Score: 22.935127114462475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for certain tasks. However, the lack of high-quality preference data for theorem proving presents a significant challenge. In this paper, we innovatively apply DPO to formal automated theorem proving and introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method. Specifically, we propose a method for constructing preference data which utilizes LLMs and existing theorem proving data to enhance the diversity of the preference data while reducing the reliance on human preference annotations. We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model through DPO. Experimental results on the MiniF2F and ProofNet datasets demonstrate the effectiveness of the proposed method.

Related papers

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models [0.7366405857677227]
This survey provides a textittheoretical unification of preference learning methods.<n>We formalize each axis with precise definitions and theorems.<n>We synthesize empirical findings across 50+ papers and provide a practitioner's decision guide for method selection.
arXiv Detail & Related papers (2026-01-03T08:33:26Z)
PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training [9.093854840532062]
PITA is a novel framework that integrates preference feedback directly into the LLM's token generation.<n> PITA learns a small preference-based guidance policy to modify token probabilities at inference time without fine-tuning.<n>We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification.
arXiv Detail & Related papers (2025-07-26T21:46:32Z)
Refine-POI: Reinforcement Fine-Tuned Large Language Models for Next Point-of-Interest Recommendation [51.08869388483333]
Large language models (LLMs) have been adopted for next point-of-interest (POI) recommendation tasks.<n>We propose Refine-POI, a reinforcement fine-tuning framework for next POI recommendation.
arXiv Detail & Related papers (2025-06-19T02:51:10Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
Debiasing Online Preference Learning via Preference Feature Preservation [64.55924745257951]
Recent preference learning frameworks simplify human preferences with binary pairwise comparisons and scalar rewards.<n>This could make large language models' responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps.<n>We propose Preference Feature Preservation to maintain the distribution of human preference features and utilize such rich signals throughout the online preference learning process.
arXiv Detail & Related papers (2025-06-06T13:19:07Z)
MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning [0.0]
We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of Large Language Models (LLMs)<n>We conduct experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 1.2% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset.<n>We also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.
arXiv Detail & Related papers (2025-05-30T08:42:14Z)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment [94.36403843133616]
Using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks.<n>Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions.<n>We propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions.
arXiv Detail & Related papers (2025-05-25T17:42:52Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities. Their alignment with human values remains critical for ensuring helpful and harmless deployments. Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency. UPFT removes the need for labeled data or exhaustive sampling. Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning [14.156753196673598]
This paper introduces a novel approach to produce high-quality reasoning traces for Large Language Models fine-tuning. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time.
arXiv Detail & Related papers (2024-10-29T17:50:31Z)
ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets. ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z)
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization [25.76847680704863]
Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO) This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM.
arXiv Detail & Related papers (2024-09-05T16:08:19Z)
Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO) Our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
arXiv Detail & Related papers (2024-06-16T09:06:17Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Towards Optimal Learning of Language Models [124.65669486710992]
We present a theory for the optimal learning of language models (LMs) We derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. We empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs.
arXiv Detail & Related papers (2024-02-27T18:52:19Z)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements. This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models. We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.