2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
- URL: http://arxiv.org/abs/2410.19720v1
- Date: Fri, 25 Oct 2024 17:47:35 GMT
- Title: 2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
- Authors: Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng,
- Abstract summary: We propose to extend the preference of DPO to two dimensions: segments and aspects.
We develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives.
- Score: 28.742104593747033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
Related papers
- seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation [2.4549463031236396]
3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis.<n>We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA)<n>We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes.
arXiv Detail & Related papers (2025-05-21T14:08:42Z) - Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm [16.66633426354087]
Direct Preference optimisation (DPO) has emerged as a powerful method for aligning Large Language Models with human preferences.<n>We investigate the performance of DPO using open-source preference datasets.<n>We propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm.
arXiv Detail & Related papers (2025-05-03T05:59:13Z) - DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution [24.460369372304807]
We introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks.
We propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance.
As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.
arXiv Detail & Related papers (2025-04-21T15:35:48Z) - 2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization [3.674552982566341]
2D-Curri-DPO is a novel framework employing a two-dimensional curriculum that jointly models Prompt Complexity (PC) and Pairwise Distinguishability.
Our approach achieves state-of-the-art performance on challenging test sets like UltraFeedback.
arXiv Detail & Related papers (2025-04-10T15:32:00Z) - Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment [74.25832963097658]
Multi-Objective Alignment (MOA) aims to align responses with multiple human preference objectives.
We find that DPO-based MOA approaches suffer from widespread preference conflicts in the data.
arXiv Detail & Related papers (2025-02-20T08:27:00Z) - DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization [75.55167570591063]
We propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process.
DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability.
Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content.
arXiv Detail & Related papers (2025-02-05T11:03:08Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.
We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.
We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z) - Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation [19.2297264550686]
Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods.
We introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities.
Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data.
arXiv Detail & Related papers (2024-08-16T07:52:00Z) - mDPO: Conditional Preference Optimization for Multimodal Large Language Models [52.607764280030196]
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment.
Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement.
We propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Multi-Dimensional Optimization for Text Summarization via Reinforcement Learning [12.083649916114402]
We propose multi-objective reinforcement learning tailored to generate balanced summaries across all four dimensions.
Unlike prior ROUGE-based rewards relying on reference summaries, we use a QA-based reward model that aligns with human preferences.
Our approach achieved substantial performance gains compared to baseline models on representative summarization datasets.
arXiv Detail & Related papers (2024-06-01T05:15:12Z) - Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs)
With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards.
The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z) - Preference optimization of protein language models as a multi-objective
binder design paradigm [0.0]
We present a multi-objective binder design paradigm based on instruction fine-tuning and direct preference optimization.
We show the proposed alignment strategy enables ProtGPT2 to effectively design binders conditioned on specified receptors and a drug developability criterion.
arXiv Detail & Related papers (2024-03-07T03:36:03Z) - PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic
Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively.
Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system.
We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.