Interpretable Reward Modeling with Active Concept Bottlenecks
- URL: http://arxiv.org/abs/2507.04695v2
- Date: Sun, 20 Jul 2025 05:53:25 GMT
- Title: Interpretable Reward Modeling with Active Concept Bottlenecks
- Authors: Sonia Laguna, Katarzyna Kobalczyk, Julia E. Vogt, Mihaela Van der Schaar,
- Abstract summary: We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning.<n>Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts.<n>We formalize an active learning strategy that dynamically acquires the most informative concept labels.
- Score: 54.00085739303773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.
Related papers
- Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization [5.822390655999343]
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-understandable concepts.<n>CBMs typically assume that datasets contain accurate concept labels, which can significantly degrade performance.<n>We introduce the Concept Preference Optimization (CPO) objective, which effectively mitigates the negative impact of concept mislabeling on CBM performance.
arXiv Detail & Related papers (2025-04-25T02:43:10Z) - V-CEM: Bridging Performance and Intervenability in Concept-based Models [6.617167508694296]
Concept-based AI (C-XAI) is a rapidly growing research field that enhances AI model interpretability by leveraging intermediate, human-understandable concepts.<n>CBMs explicitly predict concepts before making final decisions, enabling interventions to correct misclassified concepts.<n>CBMs remain effective in Out-Of-Distribution (OOD) settings with intervention, but they struggle to match the performance of black-box models.<n>We propose the Variational Concept Embedding Model (V-CEM), which leverages variational inference to improve intervention responsiveness in CEMs.
arXiv Detail & Related papers (2025-04-04T22:43:04Z) - Language Guided Concept Bottleneck Models for Interpretable Continual Learning [62.09201360376577]
Continual learning aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information.<n>Most existing CL methods focus primarily on preserving learned knowledge to improve model performance.<n>We introduce a novel framework that integrates language-guided Concept Bottleneck Models to address both challenges.
arXiv Detail & Related papers (2025-03-30T02:41:55Z) - LLM Pretraining with Continuous Concepts [71.98047075145249]
Next token prediction has been the standard training objective used in large language model pretraining.<n>We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts.
arXiv Detail & Related papers (2025-02-12T16:00:11Z) - Stochastic Concept Bottleneck Models [8.391254800873599]
Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on human-understandable concepts.
We propose Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies.
A single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness.
arXiv Detail & Related papers (2024-06-27T15:38:37Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z) - AcME -- Accelerated Model-agnostic Explanations: Fast Whitening of the
Machine-Learning Black Box [1.7534486934148554]
interpretability approaches should provide actionable insights without making the users wait.
We propose Accelerated Model-agnostic Explanations (AcME), an interpretability approach that quickly provides feature importance scores both at the global and the local level.
AcME computes feature ranking, but it also provides a what-if analysis tool to assess how changes in features values would affect model predictions.
arXiv Detail & Related papers (2021-12-23T15:18:13Z) - Guided Variational Autoencoder for Disentanglement Learning [79.02010588207416]
We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning.
We design an unsupervised strategy and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE.
arXiv Detail & Related papers (2020-04-02T20:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.