HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
- URL: http://arxiv.org/abs/2505.11475v1
- Date: Fri, 16 May 2025 17:31:19 GMT
- Title: HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
- Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, Oleksii Kuchaiev,
- Abstract summary: We introduce HelpSteer3-Preference, a high-quality, human-annotated preference dataset comprising of over 40,000 samples.<n>These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios.<n>Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%)
- Score: 43.78167339173775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference
Related papers
- Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples [58.55904048776596]
Most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages.<n>We propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages.
arXiv Detail & Related papers (2025-06-19T17:56:16Z) - Anyprefer: An Agentic Framework for Preference Data Synthesis [62.3856754548222]
We propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model.<n> external tools are introduced to assist the judge model in accurately rewarding the target model's responses.<n>The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs.
arXiv Detail & Related papers (2025-04-27T15:21:59Z) - Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.<n>Recent work has shown DPO's effectiveness relies on training data quality.<n>We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [65.01625761120924]
We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier)<n>We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z) - HelpSteer2: Open-source dataset for training top-performing reward models [9.214886217647157]
We develop HelpSteer2, a permissively licensed preference dataset.
HelpSteer2 consists of only ten thousand response pairs, an order of fewer than existing preference datasets.
We propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models.
arXiv Detail & Related papers (2024-06-12T22:28:08Z) - FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models [48.484485609995986]
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM)
There are currently no realistic datasets and benchmarks for FedLLM.
We propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics.
arXiv Detail & Related papers (2024-06-07T11:19:30Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - Everyone Deserves A Reward: Learning Customized Human Preferences [25.28261194665836]
Reward models (RMs) are essential for aligning large language models with human preferences to improve interaction quality.
We propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set.
We find several ways to better preserve the general preferring ability while training the customized RMs.
arXiv Detail & Related papers (2023-09-06T16:03:59Z) - Improving Low-resource Reading Comprehension via Cross-lingual
Transposition Rethinking [0.9236074230806579]
Extractive Reading (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data.
Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce.
We propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment.
arXiv Detail & Related papers (2021-07-11T09:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.