$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control
- URL: http://arxiv.org/abs/2306.04836v2
- Date: Wed, 10 Jan 2024 09:19:05 GMT
- Title: $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control
- Authors: Michael Giegrich, Roel Oomen, Christoph Reisinger
- Abstract summary: We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
- Score: 0.6906005491572401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel $K$-nearest neighbor resampling procedure
for estimating the performance of a policy from historical data containing
realized episodes of a decision process generated under a different policy. We
provide statistical consistency results under weak conditions. In particular,
we avoid the common assumption of identically and independently distributed
transitions and rewards. Instead, our analysis allows for the sampling of
entire episodes, as is common practice in most applications. To establish the
consistency in this setting, we generalize Stone's Theorem, a well-known result
in nonparametric statistics on local averaging, to include episodic data and
the counterfactual estimation underlying off-policy evaluation (OPE). By
focusing on feedback policies that depend deterministically on the current
state in environments with continuous state-action spaces and system-inherent
stochasticity effected by chosen actions, and relying on trajectory simulation
similar to Monte Carlo methods, the proposed method is particularly well suited
for stochastic control environments. Compared to other OPE methods, our
algorithm does not require optimization, can be efficiently implemented via
tree-based nearest neighbor search and parallelization, and does not explicitly
assume a parametric model for the environment's dynamics. Numerical experiments
demonstrate the effectiveness of the algorithm compared to existing baselines
in a variety of stochastic control settings, including a linear quadratic
regulator, trade execution in limit order books, and online stochastic bin
packing.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - Wasserstein Distributionally Robust Policy Evaluation and Learning for
Contextual Bandits [18.982448033389588]
Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment.
To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed.
We propose a novel DRO approach that employs the Wasserstein distance instead.
arXiv Detail & Related papers (2023-09-15T20:21:46Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.