Related papers: VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback

VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback

URL: http://arxiv.org/abs/2409.18417v2
Date: Thu, 12 Dec 2024 06:18:36 GMT
Title: VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback
Authors: Guoxi Zhang, Jiuding Duan,
Abstract summary: This paper addresses the cost-efficiency aspect of Reinforcement Learning from Human Feedback (RLHF)<n>RLHF leverages datasets of human preferences over outputs of large language models (LLM)<n>We show that introducing an auction mechanism can play an essential role in enhancing the cost-efficiency of RLHF.
Score: 2.07180164747172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper addresses the cost-efficiency aspect of Reinforcement Learning from Human Feedback (RLHF). RLHF leverages datasets of human preferences over outputs of large language models (LLM)s to instill human expectations into LLMs. Although preference annotation comes with a monetized cost, the economic utility of a preference dataset has not been considered by far. What exacerbates this situation is that, given complex intransitive or cyclic relationships in preference datasets, existing algorithms for fine-tuning LLMs are still far from capturing comprehensive preferences. This raises severe cost-efficiency concerns in production environments, where preference data accumulate over time. In this paper, we discuss the fine-tuning of LLMs as a monetized economy and introduce an auction mechanism to improve the efficiency of preference data collection in dollar terms. We show that introducing an auction mechanism can play an essential role in enhancing the cost-efficiency of RLHF, while maintaining satisfactory model performance. Experimental results demonstrate that our proposed auction-based protocol is cost-effective for fine-tuning LLMs concentrating on high-quality feedback.

Related papers

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay [61.823835392216544]
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs)<n>We propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.<n>Our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
arXiv Detail & Related papers (2025-06-05T17:55:43Z)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment [94.36403843133616]
Using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks.<n>Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions.<n>We propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions.
arXiv Detail & Related papers (2025-05-25T17:42:52Z)
Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing [14.114970711442512]
This paper introduces Attention Pruning, a fairness-aware simulated annealing approach to prune attention heads in large language models (LLMs) Our experiments show that Attention Pruning achieves up to $40%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.
arXiv Detail & Related papers (2025-03-20T03:02:32Z)
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Parity LLM Data Valuation [11.36712576361739]
Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples. We propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation.
arXiv Detail & Related papers (2025-03-02T22:51:12Z)
Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%. DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z)
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO) EACO aligns MLLMs by self-generated preference data using only 5k images economically. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z)
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance. It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis [18.44272589315175]
We show how to balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data.
arXiv Detail & Related papers (2024-10-09T05:15:13Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences. Our key idea is leveraging the human prior knowledge within the small (seed) data. We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Prototypical Reward Network for Data-Efficient RLHF [17.220998116937444]
A reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs) Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback.
arXiv Detail & Related papers (2024-06-06T15:23:30Z)
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z)
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [54.682106515794864]
offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers to use pre-trained Language Models (LMs) for offline RL. Empirical results indicate $textbfLaMo$ achieves state-of-the-art performance in sparse-reward tasks.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.