R.I.P.: Better Models by Survival of the Fittest Prompts
- URL: http://arxiv.org/abs/2501.18578v1
- Date: Thu, 30 Jan 2025 18:50:25 GMT
- Title: R.I.P.: Better Models by Survival of the Fittest Prompts
- Authors: Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason Weston, Jing Xu,
- Abstract summary: We introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses.
This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair.
- Score: 51.2293437372642
- License:
- Abstract: Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.
Related papers
- Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.
Recent work has shown DPO's effectiveness relies on training data quality.
We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning [15.776175440446414]
We introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation.
Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal.
We preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW.
arXiv Detail & Related papers (2024-11-04T18:54:39Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild [57.272096543738336]
We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs)
WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs.
We have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs.
arXiv Detail & Related papers (2024-06-07T09:15:44Z) - SimPO: Simple Preference Optimization with a Reference-Free Reward [43.136307294076545]
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm.
We propose SimPO, a simpler yet more effective approach to DPO.
SimPO consistently and significantly outperforms DPO without substantially increasing response length.
arXiv Detail & Related papers (2024-05-23T16:01:46Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.