Which Rewards Matter? Reward Selection for Reinforcement Learning under Limited Feedback
- URL: http://arxiv.org/abs/2510.00144v1
- Date: Tue, 30 Sep 2025 18:17:49 GMT
- Title: Which Rewards Matter? Reward Selection for Reinforcement Learning under Limited Feedback
- Authors: Shreyas Chaudhari, Renhao Zhang, Philip S. Thomas, Bruno Castro da Silva,
- Abstract summary: We study the problem of reward selection for reinforcement learning from limited feedback.<n>We find that critical subsets of rewards are those that guide the agent along optimal trajectories.<n>We find that effective selection methods yield near-optimal policies with significantly fewer reward labels than full supervision.
- Score: 16.699326038073856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability of reinforcement learning algorithms to learn effective policies is determined by the rewards available during training. However, for practical problems, obtaining large quantities of reward labels is often infeasible due to computational or financial constraints, particularly when relying on human feedback. When reinforcement learning must proceed with limited feedback -- only a fraction of samples get rewards labeled -- a fundamental question arises: which samples should be labeled to maximize policy performance? We formalize this problem of reward selection for reinforcement learning from limited feedback (RLLF), introducing a new problem formulation that facilitates the study of strategies for selecting impactful rewards. Two types of selection strategies are investigated: (i) heuristics that rely on reward-free information such as state visitation and partial value functions, and (ii) strategies pre-trained using auxiliary evaluative feedback. We find that critical subsets of rewards are those that (1) guide the agent along optimal trajectories, and (2) support recovery toward near-optimal behavior after deviations. Effective selection methods yield near-optimal policies with significantly fewer reward labels than full supervision, establishing reward selection as a powerful paradigm for scaling reinforcement learning in feedback-limited settings.
Related papers
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - A General Framework for Off-Policy Learning with Partially-Observed Reward [13.866986480307007]
Off-policy learning (OPL) in contextual bandits aims to learn a policy that maximizes the expected target reward.<n>When rewards are only partially observed, the effectiveness of OPL degrades severely.<n>We propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR)
arXiv Detail & Related papers (2025-06-17T11:58:11Z) - Information-Theoretic Reward Decomposition for Generalizable RLHF [51.550547285296794]
We decompose the reward value into two independent components: prompt-free reward and prompt-related reward.<n>We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
arXiv Detail & Related papers (2025-04-08T13:26:07Z) - ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization [41.074747242532695]
Online Reward Selection and Policy Optimization (ORSO) is a novel approach that frames the selection of shaping reward function as an online model selection problem.<n>ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 times)<n>ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts.
arXiv Detail & Related papers (2024-10-17T17:55:05Z) - Hindsight PRIORs for Reward Learning from Human Preferences [3.4990427823966828]
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors.
Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference.
We introduce a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance.
arXiv Detail & Related papers (2024-04-12T21:59:42Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Policy Optimization via Adv2: Adversarial Learning on Advantage Functions [6.793286055326244]
We revisit the reduction of learning in adversarial Markov decision processes [MDPs] to adversarial learning based on $Q$--values.<n>We discuss the impact of the reduction of learning in adversarial MDPs to adversarial learning in the practical scenarios where transition kernels are unknown.
arXiv Detail & Related papers (2023-10-25T08:53:51Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Self Punishment and Reward Backfill for Deep Q-Learning [6.572828651397661]
Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment.
In many environments, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective.
We propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward.
arXiv Detail & Related papers (2020-04-10T11:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.