A General Framework for Off-Policy Learning with Partially-Observed Reward
- URL: http://arxiv.org/abs/2506.14439v1
- Date: Tue, 17 Jun 2025 11:58:11 GMT
- Title: A General Framework for Off-Policy Learning with Partially-Observed Reward
- Authors: Rikiya Takehi, Masahiro Asami, Kosuke Kawakami, Yuta Saito,
- Abstract summary: Off-policy learning (OPL) in contextual bandits aims to learn a policy that maximizes the expected target reward.<n>When rewards are only partially observed, the effectiveness of OPL degrades severely.<n>We propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR)
- Score: 13.866986480307007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards are only partially observed, the effectiveness of OPL degrades severely. Well-known examples of such partial rewards include explicit ratings in content recommendations, conversion signals on e-commerce platforms that are partial due to delay, and the issue of censoring in medical problems. One possible solution to deal with such partial rewards is to use secondary rewards, such as dwelling time, clicks, and medical indicators, which are more densely observed. However, relying solely on such secondary rewards can also lead to poor policy learning since they may not align with the target reward. Thus, this work studies a new and general problem of OPL where the goal is to learn a policy that maximizes the expected target reward by leveraging densely observed secondary rewards as supplemental data. We then propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR), which effectively uses the secondary rewards in addition to the partially-observed target reward to achieve effective OPL despite the challenging scenario. We also discuss a case where we aim to optimize not only the expected target reward but also the expected secondary rewards to some extent; counter-intuitively, we will show that leveraging the two objectives is in fact advantageous also for the optimization of only the target reward. Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that HyPeR outperforms existing methods in various scenarios.
Related papers
- Reward-Conditioned Reinforcement Learning [56.417273471201845]
We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications.<n>RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy.<n>Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
arXiv Detail & Related papers (2026-03-05T11:29:17Z) - GOPO: Policy Optimization using Ranked Rewards [12.100854296428524]
Group Ordinal Policy Optimization (GOPO) uses only the ranking of the rewards and discards their magnitudes.<n>We demonstrate consistent improvements across a range of tasks and model sizes.
arXiv Detail & Related papers (2026-02-01T22:07:11Z) - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents [28.145430029174577]
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments.<n>Existing approaches typically rely on outcome-based rewards that are only provided at the final answer.<n>In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
arXiv Detail & Related papers (2025-10-16T17:59:32Z) - Which Rewards Matter? Reward Selection for Reinforcement Learning under Limited Feedback [16.699326038073856]
We study the problem of reward selection for reinforcement learning from limited feedback.<n>We find that critical subsets of rewards are those that guide the agent along optimal trajectories.<n>We find that effective selection methods yield near-optimal policies with significantly fewer reward labels than full supervision.
arXiv Detail & Related papers (2025-09-30T18:17:49Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization [73.16975077770765]
Recent advancements in reinforcement learning have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO)<n>It is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO)<n>This work decomposes PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance.
arXiv Detail & Related papers (2025-06-17T14:30:06Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Information-Theoretic Reward Decomposition for Generalizable RLHF [51.550547285296794]
We decompose the reward value into two independent components: prompt-free reward and prompt-related reward.<n>We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
arXiv Detail & Related papers (2025-04-08T13:26:07Z) - Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)<n>Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.<n>This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.<n>We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - Hindsight PRIORs for Reward Learning from Human Preferences [3.4990427823966828]
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors.
Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference.
We introduce a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance.
arXiv Detail & Related papers (2024-04-12T21:59:42Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, designing immediate reward signals is difficult.
We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z) - Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning
with Average and Discounted Rewards [15.082715993594121]
We investigate the problem of learning a policy that treats its users equitably.
In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness, is optimized.
We describe how several classic deep RL algorithms can be adapted to our fair optimization problem.
arXiv Detail & Related papers (2020-08-18T07:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.