Related papers: Efficient Policy Learning from Surrogate-Loss Classification Reductions

Efficient Policy Learning from Surrogate-Loss Classification Reductions

URL: http://arxiv.org/abs/2002.05153v1
Date: Wed, 12 Feb 2020 18:54:41 GMT
Title: Efficient Policy Learning from Surrogate-Loss Classification Reductions
Authors: Andrew Bennett and Nathan Kallus
Abstract summary: We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning. We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters. We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
Score: 65.91730154730905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work on policy learning from observational data has highlighted the importance of efficient policy evaluation and has proposed reductions to weighted (cost-sensitive) classification. But, efficient policy evaluation need not yield efficient estimation of policy parameters. We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning with any score function, either direct, inverse-propensity weighted, or doubly robust. We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters. We draw a contrast to actual (possibly weighted) binary classification, where correct specification implies a parametric model, while for policy learning it only implies a semiparametric model. In light of this, we instead propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters. We propose a particular method based on recent developments on solving moment problems using neural networks and demonstrate the efficiency and regret benefits of this method empirically.

Related papers

Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Optimal Policy Adaptation under Covariate Shift [15.703626346971182]
We propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets. We derive the identifiability assumptions for the reward induced by a given policy. We then learn the optimal policy by optimizing the estimated reward.
arXiv Detail & Related papers (2025-01-14T12:33:02Z)
Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state. This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy. We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications. Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z)
Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions. We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.