MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference
- URL: http://arxiv.org/abs/2602.15206v1
- Date: Mon, 16 Feb 2026 21:36:28 GMT
- Title: MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference
- Authors: Raphaƫl Baur, Yannick Metz, Maria Gkoulta, Mennatallah El-Assady, Giorgia Ramponi, Thomas Kleine Buening,
- Abstract summary: Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms.<n>We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders.<n>We show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations.
- Score: 22.19400649559091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.
Related papers
- Reinforcement Learning from Multi-level and Episodic Human Feedback [1.9686770963118378]
We propose an algorithm to efficiently learn both the reward function and the optimal policy from multi-level human feedback.<n>We show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
arXiv Detail & Related papers (2025-04-20T20:09:19Z) - Reward Learning from Multiple Feedback Types [7.910064218813772]
We show that diverse types of feedback can be utilized and lead to strong reward modeling performance.<n>This work is the first strong indicator of the potential of multi-type feedback for RLHF.
arXiv Detail & Related papers (2025-02-28T13:29:54Z) - Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models [6.472081755630166]
We show how linear aggregation of rewards exhibits some vulnerabilities.<n>We propose a transformation of reward functions inspired by economic theory of utility functions.<n>We show that models trained with Inada-transformations score as more helpful while being less harmful.
arXiv Detail & Related papers (2025-01-08T19:03:17Z) - Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Learning Recommender Systems with Soft Target: A Decoupled Perspective [49.83787742587449]
We propose a novel decoupled soft label optimization framework to consider the objectives as two aspects by leveraging soft labels.
We present a sensible soft-label generation algorithm that models a label propagation algorithm to explore users' latent interests in unobserved feedback via neighbors.
arXiv Detail & Related papers (2024-10-09T04:20:15Z) - Regularized Contrastive Partial Multi-view Outlier Detection [76.77036536484114]
We propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD)
In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency.
Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-02T14:34:27Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - Simulating Bandit Learning from User Feedback for Extractive Question
Answering [51.97943858898579]
We study learning from user feedback for extractive question answering by simulating feedback using supervised data.
We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-predicted answers.
arXiv Detail & Related papers (2022-03-18T17:47:58Z) - Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views.
We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.