HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
- URL: http://arxiv.org/abs/2311.09528v1
- Date: Thu, 16 Nov 2023 03:13:29 GMT
- Title: HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
- Authors: Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan
Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant,
Aidan Swope, Oleksii Kuchaiev
- Abstract summary: Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench.
HelpSteer is a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful.
- Score: 9.766582733709726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing open-source helpfulness preference datasets do not specify what
makes some responses more helpful and others less so. Models trained on these
datasets can incidentally learn to model dataset artifacts (e.g. preferring
longer but unhelpful responses only due to their length). To alleviate this
problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated
for the various aspects that make responses helpful. Specifically, our
37k-sample dataset has annotations for correctness, coherence, complexity, and
verbosity in addition to overall helpfulness of responses. Training Llama 2 70B
using the HelpSteer dataset with SteerLM technique produces a model that scores
7.54 on MT Bench, which is currently the highest score for open models that do
not require training data from more powerful models (e.g. GPT4). We release
this dataset with CC-BY-4.0 license at
https://huggingface.co/datasets/nvidia/HelpSteer
Related papers
- GenQA: Generating Millions of Instructions from a Handful of Prompts [67.54980063851605]
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models.
In this work, we study methods for generating large instruction datasets from a single prompt.
Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
arXiv Detail & Related papers (2024-06-14T17:44:08Z) - HelpSteer2: Open-source dataset for training top-performing reward models [9.214886217647157]
We develop HelpSteer2, a permissively licensed preference dataset.
HelpSteer2 consists of only ten thousand response pairs, an order of fewer than existing preference datasets.
We propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models.
arXiv Detail & Related papers (2024-06-12T22:28:08Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data.
Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - VANiLLa : Verbalized Answers in Natural Language at Large Scale [2.9098477555578333]
This dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets.
The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact.
arXiv Detail & Related papers (2021-05-24T16:57:54Z) - Rapidly Bootstrapping a Question Answering Dataset for COVID-19 [88.86456834766288]
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19.
This is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available.
arXiv Detail & Related papers (2020-04-23T17:35:11Z) - Have you forgotten? A method to assess if machine learning models have
forgotten data [20.9131206112401]
In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity.
In this paper, we want to address the challenging question of whether data have been forgotten by a model.
We establish statistical methods that compare the target's outputs with outputs of models trained with different datasets.
arXiv Detail & Related papers (2020-04-21T16:13:45Z) - What do Models Learn from Question Answering Datasets? [2.28438857884398]
We investigate if models are learning reading comprehension from question answering datasets.
We evaluate models on their generalizability to out-of-domain examples, responses to missing or incorrect data, and ability to handle question variations.
We make recommendations for building future QA datasets that better evaluate the task of question answering through reading comprehension.
arXiv Detail & Related papers (2020-04-07T15:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.