Related papers: HelpSteer2-Preference: Complementing Ratings with Preferences

HelpSteer2-Preference: Complementing Ratings with Preferences

URL: http://arxiv.org/abs/2410.01257v1
Date: Wed, 2 Oct 2024 06:05:52 GMT
Title: HelpSteer2-Preference: Complementing Ratings with Preferences
Authors: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong,
Abstract summary: Reward models are critical for aligning models to follow instructions. There is a lack of evidence that either approach is better than the other when adequately matched for data. We propose a novel approach to combine Bradley-hugging and Regression reward modeling.
Score: 45.01567242039055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

Related papers

BLEUBERI: BLEU is a surprisingly effective reward for instruction following [30.04785229682666]
We show that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets.<n>We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL.
arXiv Detail & Related papers (2025-05-16T10:11:43Z)
Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL) We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training [8.352666876052616]
We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models.<n>Our method frames alignment as online reinforcement learning from human feedback.<n>Our 2.6B emphDI*-SDXL-1step model outperforms the 50-step 12B FLUX-dev model.
arXiv Detail & Related papers (2024-10-28T10:26:19Z)
Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback. We frame the selection of an effective dataset as a simple regret minimization task. We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z)
Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.<n>We empirically find that this training paradigm limits the one-step generation performance of consistency models.<n>We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL) It streamlines the integration of preferences into an arbitrary class of pre-trained models. We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
Differentially Private Reward Estimation with Preference Feedback [15.943664678210146]
Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers. We focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers.
arXiv Detail & Related papers (2023-10-30T16:58:30Z)
Zephyr: Direct Distillation of LM Alignment [59.03530095974505]
We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy. We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
arXiv Detail & Related papers (2023-10-25T19:25:16Z)
Better Diffusion Models Further Improve Adversarial Training [97.44991845907708]
It has been recognized that the data generated by the diffusion probabilistic model (DDPM) improves adversarial training. This paper gives an affirmative answer by employing the most recent diffusion model which has higher efficiency. Our adversarially trained models achieve state-of-the-art performance on RobustBench using only generated data.
arXiv Detail & Related papers (2023-02-09T13:46:42Z)
Backward Compatibility During Data Updates by Weight Interpolation [17.502410289568587]
We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI) BCWI reduces negative flips without sacrificing the improved accuracy of the new model. We also explore the use of importance weighting during and averaging the weights of multiple new models in order to further reduce negative flips.
arXiv Detail & Related papers (2023-01-25T12:23:10Z)
Uncertainty Quantification of MLE for Entity Ranking with Covariates [3.2839905453386162]
This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons. We propose a novel model, Co-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model. We derive the maximum likelihood estimator of $alpha_i*_i=1n$ and $beta*$ under a sparse comparison graph. We validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset.
arXiv Detail & Related papers (2022-12-20T02:28:27Z)
Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return. Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions. The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z)
Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.