Related papers: Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

URL: http://arxiv.org/abs/2509.00648v2
Date: Tue, 14 Oct 2025 17:40:50 GMT
Title: Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits
Authors: Kushagra Chandak, Vincent Liu, Haanvid Lee,
Abstract summary: Inverse Propensity Score (IPS) weighting suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored.<n>Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings.<n>We introduce Context-Action Embedding Learning for MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator.
Score: 3.5219188193742563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.

Related papers

In-Context Probing for Membership Inference in Fine-Tuned Language Models [14.590625376049955]
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs)<n>We propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics.<n>ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates.
arXiv Detail & Related papers (2025-12-18T08:26:26Z)
SoK: Data Minimization in Machine Learning [49.60064304454055]
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task.<n>The relevance of data minimization is particularly pronounced in machine learning (ML) applications.<n>Existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection.<n>This work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization.
arXiv Detail & Related papers (2025-08-14T17:00:13Z)
RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation [73.2390735383842]
We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. We show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
arXiv Detail & Related papers (2024-06-03T14:51:27Z)
Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces. We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z)
Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits [41.91108406329159]
Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. We introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves.
arXiv Detail & Related papers (2023-12-03T17:04:57Z)
Learning Action Embeddings for Off-Policy Evaluation [6.385697591955264]
Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. But when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces.
arXiv Detail & Related papers (2023-05-06T06:44:30Z)
A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem. We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z)
Minimax Weight Learning for Absorbing MDPs [0.276240219662896]
We study undiscounted off-policy policy evaluation for absorbing MDPs. We propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure.
arXiv Detail & Related papers (2023-01-09T06:32:11Z)
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z)
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors. Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP) We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z)
Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.