Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
- URL: http://arxiv.org/abs/2402.14664v1
- Date: Thu, 22 Feb 2024 16:09:45 GMT
- Title: Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
- Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
- Abstract summary: In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation and learning.
We introduce a unified Bayesian framework to capture these correlations through structured and informative priors.
We propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations.
- Score: 14.203316003782604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In interactive systems, actions are often correlated, presenting an
opportunity for more sample-efficient off-policy evaluation (OPE) and learning
(OPL) in large action spaces. We introduce a unified Bayesian framework to
capture these correlations through structured and informative priors. In this
framework, we propose sDM, a generic Bayesian approach designed for OPE and
OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM
leverages action correlations without compromising computational efficiency.
Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics
that assess the average performance of algorithms across multiple problem
instances, deviating from the conventional worst-case assessments. We analyze
sDM in OPE and OPL, highlighting the benefits of leveraging action
correlations. Empirical evidence showcases the strong performance of sDM.
Related papers
- Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences [24.361550505778155]
Offcommerce evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-policy-policy.
This paper introduces a causal deepset framework that relaxes several key structural assumptions.
We present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations.
arXiv Detail & Related papers (2024-07-25T10:02:11Z) - ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization [52.5587113539404]
We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration.
Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks.
arXiv Detail & Related papers (2024-02-22T13:22:06Z) - Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment
Effect Estimation [137.3520153445413]
A notable gap exists in the evaluation of causal discovery methods, where insufficient emphasis is placed on downstream inference.
We evaluate seven established baseline causal discovery methods including a newly proposed method based on GFlowNets.
The results of our study demonstrate that some of the algorithms studied are able to effectively capture a wide range of useful and diverse ATE modes.
arXiv Detail & Related papers (2023-07-11T02:58:10Z) - Mimicking Better by Matching the Approximate Action Distribution [48.81067017094468]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Context-Aware Bayesian Network Actor-Critic Methods for Cooperative
Multi-Agent Reinforcement Learning [7.784991832712813]
We introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy.
We develop practical algorithms to learn the context-aware Bayesian network policies.
Empirical results on a range of MARL benchmarks show the benefits of our approach.
arXiv Detail & Related papers (2023-06-02T21:22:27Z) - Counterfactual Learning with Multioutput Deep Kernels [0.0]
In this paper, we address the challenge of performing counterfactual inference with observational data.
We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently.
arXiv Detail & Related papers (2022-11-20T23:28:41Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.