GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
- URL: http://arxiv.org/abs/2410.08193v3
- Date: Mon, 10 Feb 2025 22:20:07 GMT
- Title: GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
- Authors: Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh,
- Abstract summary: Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences.
Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining.
We introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model.
- Score: 36.52424795446663
- License:
- Abstract: Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.
Related papers
- Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - RAIN: Your Language Models Can Align Themselves without Finetuning [25.703729145091483]
Large language models (LLMs) often demonstrate inconsistencies with human preferences.
We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting.
We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
arXiv Detail & Related papers (2023-09-13T17:59:09Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.