Related papers: Bayesian Off-Policy Evaluation and Learning for Large Action Spaces

Bayesian Off-Policy Evaluation and Learning for Large Action Spaces

URL: http://arxiv.org/abs/2402.14664v1
Date: Thu, 22 Feb 2024 16:09:45 GMT
Title: Bayesian Off-Policy Evaluation and Learning for Large Action Spaces
Authors: Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
Abstract summary: In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation and learning. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. We propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations.
Score: 14.203316003782604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach designed for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.

Related papers

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits [15.916834591090009]
We explore off-policy evaluation and learning in contextual bandits. This setting is widespread in fields such as recommender systems and healthcare. We introduce a concept of factored action space, which allows us to decompose each subset into binary indicators.
arXiv Detail & Related papers (2024-08-20T21:25:04Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences [24.361550505778155]
Offcommerce evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-policy-policy. This paper introduces a causal deepset framework that relaxes several key structural assumptions. We present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations.
arXiv Detail & Related papers (2024-07-25T10:02:11Z)
ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization [52.5587113539404]
We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks.
arXiv Detail & Related papers (2024-02-22T13:22:06Z)
Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z)
Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning [7.784991832712813]
We introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy. We develop practical algorithms to learn the context-aware Bayesian network policies. Empirical results on a range of MARL benchmarks show the benefits of our approach.
arXiv Detail & Related papers (2023-06-02T21:22:27Z)
Counterfactual Learning with Multioutput Deep Kernels [0.0]
In this paper, we address the challenge of performing counterfactual inference with observational data. We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently.
arXiv Detail & Related papers (2022-11-20T23:28:41Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)
Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit. We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.