Learning Loss Landscapes in Preference Optimization
- URL: http://arxiv.org/abs/2411.06568v1
- Date: Sun, 10 Nov 2024 19:11:48 GMT
- Title: Learning Loss Landscapes in Preference Optimization
- Authors: Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh,
- Abstract summary: We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms.
Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance.
Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios.
- Score: 39.15940594751445
- License:
- Abstract: We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms. Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance. To address this issue, we introduce a novel PO framework based on mirror descent, which can recover existing methods like Direct Preference Optimization (DPO) and Odds-Ratio Preference Optimization (ORPO) for specific choices of the mirror map. Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios. These new loss functions lead to significant performance improvements over DPO and ORPO across several tasks. Additionally, we demonstrate the generalization capability of our approach by applying the discovered loss functions to fine-tuning large language models using mixed-quality data, where they outperform ORPO.
Related papers
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations.
Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations.
We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms [13.134564730161983]
This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants.
We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations.
Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
arXiv Detail & Related papers (2024-03-01T14:55:22Z) - Explainable Landscape-Aware Optimization Performance Prediction [0.0]
We are investigating explainable landscape-aware regression models.
The contribution of each landscape feature to the prediction of the optimization algorithm performance is estimated on a global and local level.
The results show a proof of concept that different set of features are important for different problem instances.
arXiv Detail & Related papers (2021-10-22T07:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.