Related papers: Optimizing Language Models for Human Preferences is a Causal Inference Problem

Optimizing Language Models for Human Preferences is a Causal Inference Problem

URL: http://arxiv.org/abs/2402.14979v2
Date: Wed, 5 Jun 2024 23:19:19 GMT
Title: Optimizing Language Models for Human Preferences is a Causal Inference Problem
Authors: Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency,
Abstract summary: We present an initial exploration of language model optimization for human preferences from direct outcome datasets. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We extend CPO with doubly robust CPO, which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias.
Score: 41.59906798328058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.

Related papers

Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models [14.937360204897415]
A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability.<n>We propose a context-augmented data augmentation method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus.<n>We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs.
arXiv Detail & Related papers (2026-02-10T09:45:21Z)
Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification [2.503562746177713]
We develop a framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages.<n>Building on this insight, we leverage empirical scaling laws to develop a data-driven method for optimally splitting samples between the fine-tuning and rectification stages.<n> Empirical analysis validates our framework, demonstrating improved estimation and inference performance compared to using either fine-tuning or rectification alone.
arXiv Detail & Related papers (2025-11-23T05:23:21Z)
InfoPO: On Mutual Information Maximization for Large Language Model Alignment [26.692916936162824]
We study the post-training of large language models with human preference data.<n>We propose a principled preference fine-tuning algorithm called InfoPO.
arXiv Detail & Related papers (2025-05-13T12:37:48Z)
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [52.983390470606146]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z)
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [20.959453238159863]
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging. Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. We propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning.
arXiv Detail & Related papers (2025-01-23T18:59:47Z)
GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets [19.485572131953937]
We propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting. Empirical results show GDPO can generate far more diverse responses than the baseline methods.
arXiv Detail & Related papers (2024-10-19T13:07:52Z)
Model-based Preference Optimization in Abstractive Summarization without Human Feedback [5.438770095369458]
We introduce Model-based Preference Optimization (MPO) to fine-tune Large Language Models for improved summarization abilities without any human feedback. Our experiments on standard summarization datasets and various metrics demonstrate that our proposed MPO significantly enhances the quality of generated summaries without relying on human feedback.
arXiv Detail & Related papers (2024-09-27T10:35:45Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization [0.0]
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language.
arXiv Detail & Related papers (2024-07-18T22:32:20Z)
Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences. Our key idea is leveraging the human prior knowledge within the small (seed) data. We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences. We show that it consistently outperforms PPO in language tasks by a large margin. We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.