Related papers: Percentile Criterion Optimization in Offline Reinforcement Learning

Percentile Criterion Optimization in Offline Reinforcement Learning

URL: http://arxiv.org/abs/2404.05055v1
Date: Sun, 7 Apr 2024 19:29:09 GMT
Title: Percentile Criterion Optimization in Offline Reinforcement Learning
Authors: Elita A. Lobo, Cyrus Cousins, Yair Zick, Marek Petrik,
Abstract summary: We propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity. Our theoretical and empirical results show that our results implicitly learn robust conservative policies.
Score: 22.42041973113997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.

Related papers

Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values [11.174902793218834]
We study non-rectangular robust Markov decision processes under the average-reward criterion.<n>We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal.
arXiv Detail & Related papers (2026-03-01T06:29:34Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks [7.701333337093469]
We consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold $tau$, even under adversarial conditions.<n>We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework.<n>Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold $tau$, and another that scales with the perturbation magnitude but requires no assumptions on the adversary.
arXiv Detail & Related papers (2025-06-02T13:04:18Z)
Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level) We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z)
A Convex Framework for Confounding Robust Inference [21.918894096307294]
We study policy evaluation of offline contextual bandits subject to unobserved confounders. We propose a general estimator that provides a sharp lower bound of the policy value using convex programming.
arXiv Detail & Related papers (2023-09-21T19:45:37Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z)
Soft-Robust Algorithms for Batch Reinforcement Learning [36.78967245470449]
In reinforcement learning, robust decision-making problems with limited data are usually computed by the percentile criterion. We show that the percentile criterion is non- theoretical as it is difficult to optimize and ignores the mean performance. We propose and analyze two algorithms to approximately optimize the percentile criterion.
arXiv Detail & Related papers (2020-11-30T01:36:16Z)
Data-Driven Robust Optimization using Unsupervised Deep Learning [0.0]
We show that a trained neural network can be integrated into a robust optimization model by formulating the adversarial problem as a convex mixed-integer program. We find that this approach outperforms a similar approach using kernel-based support vector sets.
arXiv Detail & Related papers (2020-11-19T11:06:54Z)
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z)
Progressive Identification of True Labels for Partial-Label Learning [112.94467491335611]
Partial-label learning (PLL) is a typical weakly supervised learning problem, where each training instance is equipped with a set of candidate labels among which only one is the true label. Most existing methods elaborately designed as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data. This paper proposes a novel framework of classifier with flexibility on the model and optimization algorithm.
arXiv Detail & Related papers (2020-02-19T08:35:15Z)
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare. We develop an approach that estimates the bounds of a given policy. We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.