Related papers: Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

URL: http://arxiv.org/abs/2505.12845v1
Date: Mon, 19 May 2025 08:33:11 GMT
Title: Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks
Authors: Ruopei Sun, Jianfeng Cai, Jinhua Zhu, Kangwen Zhao, Dongyun Xue, Wengang Zhou, Li Li, Houqiang Li,
Abstract summary: RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
Score: 81.44256822500257
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model's semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.

Related papers

Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling [0.0]
This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in large language models (LLMs)<n>Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees.<n> Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data.
arXiv Detail & Related papers (2025-05-30T04:20:00Z)
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes [54.93980123979578]
We introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences.<n>LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data.
arXiv Detail & Related papers (2025-05-08T06:59:06Z)
Like Father, Like Son: Kinship-Aware Preference Mapping (KARMA) for Automatic Alignment in Large Language Models [2.970904425631548]
Kinship-Aware pReference MApping (KARMA) is a novel framework that pairs responses from models with comparable competencies.<n>By constraining preference comparisons to outputs of similar complexity and quality, KARMA enhances the informativeness of preference data.<n> Empirical evaluations demonstrate that our kinship-aware approach leads to more consistent and interpretable alignment outcomes.
arXiv Detail & Related papers (2025-02-26T01:36:40Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences. Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
LIRE: listwise reward enhancement for preference alignment [27.50204023448716]
We propose a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks.
arXiv Detail & Related papers (2024-05-22T10:21:50Z)
Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning [68.45370492516531]
We introduce Scalarized Multi-Objective Reinforcement Learning (SMORL) for the Recommender Systems (RS) setting. SMORL agent augments standard recommendation models with additional RL layers that enforce it to simultaneously satisfy three principal objectives: accuracy, diversity, and novelty of recommendations. Our experimental results on two real-world datasets reveal a substantial increase in aggregate diversity, a moderate increase in accuracy, reduced repetitiveness of recommendations, and demonstrate the importance of reinforcing diversity and novelty as complementary objectives.
arXiv Detail & Related papers (2021-10-28T13:22:45Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.