Related papers: Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning

Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning

URL: http://arxiv.org/abs/2311.18684v1
Date: Thu, 30 Nov 2023 16:31:04 GMT
Title: Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning
Authors: Jared Markowitz, Jesse Silverberg, Gary Collins
Abstract summary: Most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. We revisit this strategy in environments with "mixed-sign" reward functions. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting.
Score: 2.793095554369282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.

Related papers

Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z)
$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning [7.97295726921338]
Temporal-Agent Reward Redistribution $TAR2$ is a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components. We show that $TAR2$ aligns with potential-based reward shaping, preserving the same optimal policies as the original environment.
arXiv Detail & Related papers (2025-02-07T12:07:57Z)
Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback. We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based. Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z)
$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z)
Off-Policy Primal-Dual Safe Reinforcement Learning [16.918188277722503]
We show that the error in cumulative cost estimation causes significant underestimation of cost when using off-policy methods. We propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in estimation. We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty.
arXiv Detail & Related papers (2024-01-26T10:33:38Z)
Bayesian Inverse Transition Learning for Offline Settings [30.10905852013852]
Reinforcement learning is commonly used for sequential decision-making in domains such as healthcare and education. We propose a new constraint-based approach that captures our desiderata for reliably learning a posterior distribution of the transition dynamics $T$. Our results demonstrate that by using our constraints, we learn a high-performing policy, while considerably reducing the policy's variance over different datasets.
arXiv Detail & Related papers (2023-08-09T17:08:29Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z)
Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process. This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework. We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.