A New Bandit Setting Balancing Information from State Evolution and
Corrupted Context
- URL: http://arxiv.org/abs/2011.07989v4
- Date: Sun, 5 Nov 2023 03:25:09 GMT
- Title: A New Bandit Setting Balancing Information from State Evolution and
Corrupted Context
- Authors: Alexander Galozy, Slawomir Nowaczyk, Mattias Ohlsson
- Abstract summary: We propose a new sequential decision-making setting combining key aspects of two established online learning problems with bandit feedback.
The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent.
We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit.
- Score: 52.67844649650687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new sequential decision-making setting, combining key aspects of
two established online learning problems with bandit feedback. The optimal
action to play at any given moment is contingent on an underlying changing
state which is not directly observable by the agent. Each state is associated
with a context distribution, possibly corrupted, allowing the agent to identify
the state. Furthermore, states evolve in a Markovian fashion, providing useful
information to estimate the current state via state history. In the proposed
problem setting, we tackle the challenge of deciding on which of the two
sources of information the agent should base its arm selection. We present an
algorithm that uses a referee to dynamically combine the policies of a
contextual bandit and a multi-armed bandit. We capture the time-correlation of
states through iteratively learning the action-reward transition model,
allowing for efficient exploration of actions. Our setting is motivated by
adaptive mobile health (mHealth) interventions. Users transition through
different, time-correlated, but only partially observable internal states,
determining their current needs. The side information associated with each
internal state might not always be reliable, and standard approaches solely
rely on the context risk of incurring high regret. Similarly, some users might
exhibit weaker correlations between subsequent states, leading to approaches
that solely rely on state transitions risking the same. We analyze our setting
and algorithm in terms of regret lower bound and upper bounds and evaluate our
method on simulated medication adherence intervention data and several
real-world data sets, showing improved empirical performance compared to
several popular algorithms.
Related papers
- Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Online learning in bandits with predicted context [8.257280652461159]
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context.
This setting is motivated by a wide range of applications where the true context for decision-making is unobserved.
We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions.
arXiv Detail & Related papers (2023-07-26T02:33:54Z) - OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary
State Tracking [55.62705574507595]
OpenPI is the only dataset annotated for open-vocabulary state tracking.
We categorize 3 types of problems on the procedure level, step level and state change level respectively.
For the evaluation metric, we propose a cluster-based metric to fix the original metric's preference for repetition.
arXiv Detail & Related papers (2023-06-01T16:48:20Z) - Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration [97.19464604735802]
A promising technique for exploration is to maximize the entropy of visited state distribution.
It tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states.
We present a novel exploration technique that maximizes the value-conditional state entropy.
arXiv Detail & Related papers (2023-05-31T01:09:28Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z) - Blind Decision Making: Reinforcement Learning with Delayed Observations [43.126718159042305]
Reinforcement learning assumes that the state update from the previous actions happens instantaneously.
When the state update is not available, the decision taken is partly in the blind since it cannot rely on the current state information.
This paper proposes an approach, where the delay in the knowledge of the state can be used, and the decisions are made based on the available information.
arXiv Detail & Related papers (2020-11-16T04:29:14Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state.
We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling.
We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.