Multi-Response Preference Optimization with Augmented Ranking Dataset
- URL: http://arxiv.org/abs/2412.07812v1
- Date: Tue, 10 Dec 2024 05:45:36 GMT
- Title: Multi-Response Preference Optimization with Augmented Ranking Dataset
- Authors: Hansle Gwon, Imjin Ahn, Young-Hak Kim, Sanghyun Park, Tae Joon Jun,
- Abstract summary: Preference optimization has played a significant role in improving the performance of Large Language Models.<n>We propose a novel approach to augment Preference Optimization datasets.<n>We also introduce a Multi-response-based Preference Optimization training method.
- Score: 2.8973004951877357
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have been remarkable, with new models consistently surpassing their predecessors. These advancements are underpinned by extensive research on various training mechanisms. Among these, Preference Optimization has played a significant role in improving the performance of LLMs by incorporating human preferences into the training process. However, constructing preference optimization datasets is challenging and the optimization process is highly sensitive to the dataset quality. In this study, we propose a novel approach to augment Preference Optimization datasets. Additionally, we introduce a Multi-response-based Preference Optimization training method that enables the simultaneous learning of multiple responses.
Related papers
- Preference-Guided Diffusion for Multi-Objective Offline Optimization [64.08326521234228]
We propose a preference-guided diffusion model for offline multi-objective optimization.
Our guidance is a preference model trained to predict the probability that one design dominates another.
Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions.
arXiv Detail & Related papers (2025-03-21T16:49:38Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.
Their alignment with human values remains critical for ensuring helpful and harmless deployments.
Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - Plug-and-Play Training Framework for Preference Optimization [25.53286104242179]
We propose a novel training framework for large language models (LLMs)
This framework employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process.
Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.
arXiv Detail & Related papers (2024-12-30T15:01:48Z) - Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance.
It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z) - Preference Optimization with Multi-Sample Comparisons [53.02717574375549]
We introduce a novel approach that extends post-training to include multi-sample comparisons.
These approaches fail to capture critical characteristics such as generative diversity and bias.
We demonstrate that multi-sample comparison is more effective in optimizing collective characteristics than single-sample comparison.
arXiv Detail & Related papers (2024-10-16T00:59:19Z) - MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning [99.09906827676748]
We introduce MotionRL, the first approach to utilize Multi-Reward Reinforcement Learning (RL) for optimizing text-to-motion generation tasks.
Our novel approach uses reinforcement learning to fine-tune the motion generator based on human preferences prior knowledge of the human perception model.
In addition, MotionRL introduces a novel multi-objective optimization strategy to approximate optimality between text adherence, motion quality, and human preferences.
arXiv Detail & Related papers (2024-10-09T03:27:14Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - AIPO: Improving Training Objective for Iterative Preference Optimization [34.24211649396053]
We study iterative preference optimization with synthetic data.
We propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO)
arXiv Detail & Related papers (2024-09-13T14:03:49Z) - TSO: Self-Training with Scaled Preference Optimization [14.3799656174528]
We propose TSO, a framework for preference optimization that conducts self-training preference learning without training an additional reward model.
TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses.
Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks.
arXiv Detail & Related papers (2024-08-31T05:37:01Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Unleashing the Potential of Large Language Models as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.
We identify two pivotal factors in model parameter learning: update direction and update method.
In particular, we borrow the theoretical framework and learning methods from gradient-based optimization to design improved strategies.
arXiv Detail & Related papers (2024-02-27T15:05:32Z) - Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization.
We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.