Related papers: The Hidden Link Between RLHF and Contrastive Learning

The Hidden Link Between RLHF and Contrastive Learning

URL: http://arxiv.org/abs/2506.22578v1
Date: Fri, 27 Jun 2025 18:51:25 GMT
Title: The Hidden Link Between RLHF and Contrastive Learning
Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen,
Abstract summary: We show that Reinforcement Learning from Human Feedback and Direct Preference Optimization can be interpreted from the perspective of mutual information.<n>Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning.<n>Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization.
Score: 24.828596020853727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.

Related papers

IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization [20.87328464098245]
Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect.<n>LLMs offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging.<n>We propose IB-GRPO, an indicator-guided alignment approach for LLM-based LPR.
arXiv Detail & Related papers (2026-01-21T06:03:05Z)
Score-based Membership Inference on Diffusion Models [3.742113529511043]
Membership inference attacks (MIAs) against diffusion models have emerged as a pressing privacy concern.<n>We present a theoretical and empirical study of score-based MIAs, focusing on the predicted noise vectors that diffusion models learn to approximate.<n>We show that the expected denoiser output points toward a kernel-weighted local mean of nearby training samples, such that its norm encodes proximity to the training set and thereby reveals membership.
arXiv Detail & Related papers (2025-09-29T16:28:55Z)
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge [24.862965044243168]
Previous methods rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations.<n>We propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions.<n>Our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods.
arXiv Detail & Related papers (2025-05-18T08:33:09Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models [68.17018458283651]
This work focuses on the offline evaluation of the chain-of-thought capabilities of LLMs. We use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. We show how to optimize LLMs based on the proposed evaluation method.
arXiv Detail & Related papers (2024-10-31T07:48:44Z)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO) XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z)
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.<n>As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.<n>This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)<n>By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.