Related papers: SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

URL: http://arxiv.org/abs/2511.07896v1
Date: Wed, 12 Nov 2025 01:27:00 GMT
Title: SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
Authors: Dengcan Liu, Jiahao Li, Zheren Fu, Yi Tu, Jiajun Li, Zhendong Mao, Yongdong Zhang,
Abstract summary: Reward models (RMs) are proxies for human preference evaluation and guiding model alignment.<n>We propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations.<n>SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters.
Score: 54.31950189922548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

Related papers

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
Toward Preference-aligned Large Language Models via Residual-based Model Steering [9.241565393225953]
We introduce Preference alignment of Large Language Models via Residual Steering (PaLRS)<n>PaLRS exploits preference signals encoded in the residual streams of Large Language Models.<n>We evaluate PaLRS on various small-to-medium-scale open-source LLMs.
arXiv Detail & Related papers (2025-09-28T17:16:16Z)
Interpretable Reward Model via Sparse Autoencoder [16.903840987027912]
We introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder into a reward model.<n>SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space.<n> Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models.
arXiv Detail & Related papers (2025-08-12T08:41:00Z)
SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z)
Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs) for extracting diverse human preferences from binary comparisons.<n>DRMs represent preferences as vectors and analyze them using Principal Component Analysis (PCA)<n>DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or language models (LMs)<n>We show that the selected hybrid mixture of synthetic and direct human preferences using HyPER achieves better RM performance compared to using either one exclusively by 7-13% on RewardBench.<n>We also analyze features from HyPER and find that prompts with moderate safety concerns or complexity benefit the most from human feedback.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback to improve model alignment. We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z)
Compositional preference models for aligning LMs [15.036426712762147]
Compositional Preference Models (CPMs) are a framework that decomposes one global preference assessment into several interpretable features. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment.
arXiv Detail & Related papers (2023-10-17T01:31:59Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.