Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
- URL: http://arxiv.org/abs/2504.09895v3
- Date: Mon, 13 Oct 2025 05:00:17 GMT
- Title: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
- Authors: Shuai Zhao, Yunqiu Xu, Linchao Zhu, Yi Yang,
- Abstract summary: RefAlign is a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models.<n>We introduce textitRefAlign, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models.
- Score: 56.44985139322512
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models. The code is available at https://github.com/mzhaoshuai/RefAlign.
Related papers
- SimGR: Escaping the Pitfalls of Generative Decoding in LLM-based Recommendation [68.00727783181289]
A core objective in recommender systems is to accurately model the distribution of user preferences over items to enable personalized recommendations.<n>We observe that existing methods inevitably introduce systematic bias when estimating item-level preference distributions.<n>We propose textbfSimply textbfGenerative textbfRecommendation (textbfSimGR), a framework that directly models item-level preference distributions in a shared latent space.
arXiv Detail & Related papers (2026-02-08T07:26:52Z) - Doubly Robust Alignment for Large Language Models [11.384429991118926]
This paper studies reinforcement learning from human feedback for aligning large language models with human preferences.<n>We propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified.
arXiv Detail & Related papers (2025-06-01T21:34:37Z) - Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs) for extracting diverse human preferences from binary comparisons.<n>DRMs represent preferences as vectors and analyze them using Principal Component Analysis (PCA)<n>DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences.
We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z) - Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss.
OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z) - Beyond Scalar Reward Model: Learning Generative Judge from Preference Data [26.219896368149236]
Learning from preference feedback is a common practice for aligning large language models(LLMs) with human value.
Scalar models lack interpretability and are known to be susceptible to biases in datasets.
This paper investigates leveraging the generation capability of LLMs to address both limitations in one shot.
arXiv Detail & Related papers (2024-10-01T07:38:58Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies.
We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values.
In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback.
We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Transforming and Combining Rewards for Aligning Large Language Models [69.44634017612798]
A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model.
We use a log-sigmoid function to transform rewards learned from Bradley-Terry preference models.
Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.
arXiv Detail & Related papers (2024-02-01T16:39:28Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.