Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning
- URL: http://arxiv.org/abs/2411.02481v3
- Date: Fri, 31 Jan 2025 21:15:53 GMT
- Title: Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning
- Authors: Guangxuan Xu, Kai Xu, Shivchander Sudalairaj, Hao Wang, Akash Srivastava,
- Abstract summary: We introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation.<n>Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal.<n>We preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW.
- Score: 15.776175440446414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference tuning relies on high-quality human preference data, which is often expensive and time-consuming to gather. In this paper, we introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation by leveraging off-the-shelf LLMs for preference data annotation. Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal. We evaluate Dr.SoW across 221 different LLM pairs and empirically find a strong correlation between the performance gap of the paired models and the quality of the reward signal. This insight provides a practical guideline for selecting LLMs for data annotation. Additionally, we introduce an end-to-end pipeline that customizes reward functions based on user query domains. Without fine-tuning, it improves accuracy on domain-specific evaluations. With a pair of Mistral-7B models, Dr.SoW achieves a RewardBench score of 82.6, outperforming the best trained reward functions from same model class and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. Further, we preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW. Our approach pushes Llama-3-8B to achieve a 37.4 % (+15.1 %) win rate on ArenaHard and a 40.7 % (+17.8 %) win rate on length-controlled AlpacaEval 2.0.
Related papers
- R.I.P.: Better Models by Survival of the Fittest Prompts [51.2293437372642]
We introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses.
This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair.
arXiv Detail & Related papers (2025-01-30T18:50:25Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We show that our approach consistently boosts DPO by a considerable margin.
Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date.
We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models.
Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z) - Generative Reward Models [42.30530024761532]
Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs)
Recent work has shown that synthetic preferences labels may not align well with human preference judgments.
We propose a hybrid approach that unifies RLHF and RLAIF methodologies.
Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.
arXiv Detail & Related papers (2024-10-02T17:58:39Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge [15.980606104936365]
Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications.
Existing frameworks like Alpaca-Eval 2.0 LC citedubois2024lengthcontrolledalpacaevalsimpleway and Arena-Hard v0.1 citeli2024crowdsourced are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts.
We introduce a novel data pipeline that curates, domain-specific evaluation sets tailored for LLM-as
arXiv Detail & Related papers (2024-08-16T15:41:43Z) - Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment [57.03947082589616]
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets.
We study this and find that preference data gives a better learning signal when the underlying responses are contrastive.
We introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs.
Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%.
arXiv Detail & Related papers (2024-08-12T16:24:51Z) - Closing the gap between open-source and commercial large language models for medical evidence summarization [20.60798771155072]
Large language models (LLMs) hold great promise in summarizing medical evidence.
Most recent studies focus on the application of proprietary LLMs.
While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones.
arXiv Detail & Related papers (2024-07-25T05:03:01Z) - Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation [20.41379322900742]
We introduce FLAMe, a family of Foundational Large Autorater Models.
FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks.
We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning.
arXiv Detail & Related papers (2024-07-15T15:33:45Z) - Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - RAIN: Your Language Models Can Align Themselves without Finetuning [25.703729145091483]
Large language models (LLMs) often demonstrate inconsistencies with human preferences.
We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting.
We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
arXiv Detail & Related papers (2023-09-13T17:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.