RM-Distiller: Exploiting Generative LLM for Reward Model Distillation
- URL: http://arxiv.org/abs/2601.14032v1
- Date: Tue, 20 Jan 2026 14:53:32 GMT
- Title: RM-Distiller: Exploiting Generative LLM for Reward Model Distillation
- Authors: Hongli Zhou, Hui Huang, Wei Liu, Chenglong Wang, Xingyuan Bu, Lvyuan Han, Fuhai Song, Muyun Yang, Wenhao Jiang, Hailong Cao, Tiejun Zhao,
- Abstract summary: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences.<n>Existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation.<n>We propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs.
- Score: 47.016779894794304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.
Related papers
- Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z) - RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation [31.28415780479141]
Reinforcement Learning from Teacher-Model Refinement (RLfR) is a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o)<n>On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines.
arXiv Detail & Related papers (2025-07-29T20:35:35Z) - Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits [73.26238057915583]
We introduce LASeR, which frames reward model selection as a multi-armed bandit problem.<n>We show that LASeR boosts iterative training, improving the absolute average accuracy of Llama-3-8B over three datasets.<n>We also show that LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline.
arXiv Detail & Related papers (2024-10-02T16:46:38Z) - Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model [12.6937643116018]
Large Language Models (LLMs) have been effectively utilized as recommenders, achieving impressive performance.
However, the high inference latency of LLMs significantly restricts their practical deployment.
This work investigates knowledge distillation from cumbersome LLM-based recommendation models to lightweight sequential models.
arXiv Detail & Related papers (2024-05-01T06:23:54Z) - Hybrid Distillation: Connecting Masked Autoencoders with Contrastive
Learners [102.20090188997301]
We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths.
In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy.
Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
arXiv Detail & Related papers (2023-06-28T02:19:35Z) - From Cloze to Comprehension: Retrofitting Pre-trained Masked Language
Model to Pre-trained Machine Reader [130.45769668885487]
Pre-trained Machine Reader (PMR) is a novel method for retrofitting masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data.
To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data.
PMR has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.
arXiv Detail & Related papers (2022-12-09T10:21:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.