Counterfactual Learning of Stochastic Policies with Continuous Actions:
from Models to Offline Evaluation
- URL: http://arxiv.org/abs/2004.11722v5
- Date: Mon, 23 Aug 2021 08:39:15 GMT
- Title: Counterfactual Learning of Stochastic Policies with Continuous Actions:
from Models to Offline Evaluation
- Authors: Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert,
Julien Mairal
- Abstract summary: We introduce a modelling strategy based on a joint kernel embedding of contexts and actions.
We empirically show that the optimization aspect of counterfactual learning is important.
We propose an evaluation protocol for offline policies in real-world logged systems.
- Score: 41.21447375318793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Counterfactual reasoning from logged data has become increasingly important
for many applications such as web advertising or healthcare. In this paper, we
address the problem of learning stochastic policies with continuous actions
from the viewpoint of counterfactual risk minimization (CRM). While the CRM
framework is appealing and well studied for discrete actions, the continuous
action case raises new challenges about modelization, optimization, and~offline
model selection with real data which turns out to be particularly challenging.
Our paper contributes to these three aspects of the CRM estimation pipeline.
First, we introduce a modelling strategy based on a joint kernel embedding of
contexts and actions, which overcomes the shortcomings of previous
discretization approaches. Second, we empirically show that the optimization
aspect of counterfactual learning is important, and we demonstrate the benefits
of proximal point algorithms and differentiable estimators. Finally, we propose
an evaluation protocol for offline policies in real-world logged systems, which
is challenging since policies cannot be replayed on test data, and we release a
new large-scale dataset along with multiple synthetic, yet realistic,
evaluation setups.
Related papers
- Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Sequential Counterfactual Risk Minimization [37.600857571957754]
"Sequential Counterfactual Risk Minimization" is a framework for dealing with the logged bandit feedback problem.
We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM.
arXiv Detail & Related papers (2023-02-23T15:59:30Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - Revisiting Design Choices in Model-Based Offline Reinforcement Learning [39.01805509055988]
Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies.
This paper compares and designs novel protocols to investigate their interaction with other hyper parameters, such as the number of models, or imaginary rollout horizon.
arXiv Detail & Related papers (2021-10-08T13:51:34Z) - An Offline Risk-aware Policy Selection Method for Bayesian Markov
Decision Processes [0.0]
Exploitation vs Caution (EvC) is a paradigm that elegantly incorporates model uncertainty abiding by the Bayesian formalism.
We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes.
In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners.
arXiv Detail & Related papers (2021-05-27T20:12:20Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - S^3-Rec: Self-Supervised Learning for Sequential Recommendation with
Mutual Information Maximization [104.87483578308526]
We propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation.
For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence.
Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods.
arXiv Detail & Related papers (2020-08-18T11:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.