HelpSteer2: Open-source dataset for training top-performing reward models
- URL: http://arxiv.org/abs/2406.08673v1
- Date: Wed, 12 Jun 2024 22:28:08 GMT
- Title: HelpSteer2: Open-source dataset for training top-performing reward models
- Authors: Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev,
- Abstract summary: We develop HelpSteer2, a permissively licensed preference dataset.
HelpSteer2 consists of only ten thousand response pairs, an order of fewer than existing preference datasets.
We propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models.
- Score: 9.214886217647157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at https://huggingface.co/datasets/nvidia/HelpSteer2 and code is available at https://github.com/NVIDIA/NeMo-Aligner
Related papers
- Fantastic LLMs for Preference Data Annotation and How to (not) Find Them [15.776175440446414]
Preference tuning of large language models (LLMs) relies on high-quality human preference data.
Existing methods can use trained reward models or proprietary model as judges for preference annotation.
We introduce customized density ratio (CDR) that leverages open-source LLMs for data annotation.
arXiv Detail & Related papers (2024-11-04T18:54:39Z) - Reward Modeling with Weak Supervision for Language Models [12.599789817157188]
This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance.
By analyzing RLHF datasets to identify imprecise responses, we wrote simple labeling functions and then calibrated a label model to weakly unlabeled data.
Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets.
arXiv Detail & Related papers (2024-10-28T09:37:58Z) - Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs [54.11217789754743]
We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets.
We curated the Skywork-Reward data collection, which contains only 80K preference pairs.
We developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard.
arXiv Detail & Related papers (2024-10-24T06:06:26Z) - 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data [0.0]
This paper presents a compute-efficient approach to pre-training a Language Model-the "1.5-Pints"-in only 9 days.
Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple's OpenELM and Microsoft's Phi.
This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated and manual human review.
arXiv Detail & Related papers (2024-08-07T02:14:52Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM [9.766582733709726]
Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench.
HelpSteer is a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful.
arXiv Detail & Related papers (2023-11-16T03:13:29Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.