Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related
Rewards
- URL: http://arxiv.org/abs/2307.14138v1
- Date: Wed, 26 Jul 2023 12:06:13 GMT
- Title: Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related
Rewards
- Authors: Behzad Nourani-Koliji, Steven Bilaj, Amir Rezaei Balef, Setareh
Maghsudi
- Abstract summary: We study the piecewise stationary semi-bandit problem with causally related rewards.
In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process.
The problem becomes aggravated in the semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms.
- Score: 5.347237827669861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the piecewise stationary combinatorial semi-bandit problem with
causally related rewards. In our nonstationary environment, variations in the
base arms' distributions, causal relationships between rewards, or both, change
the reward generation process. In such an environment, an optimal
decision-maker must follow both sources of change and adapt accordingly. The
problem becomes aggravated in the combinatorial semi-bandit setting, where the
decision-maker only observes the outcome of the selected bundle of arms. The
core of our proposed policy is the Upper Confidence Bound (UCB) algorithm. We
assume the agent relies on an adaptive approach to overcome the challenge. More
specifically, it employs a change-point detector based on the Generalized
Likelihood Ratio (GLR) test. Besides, we introduce the notion of group restart
as a new alternative restarting strategy in the decision making process in
structured environments. Finally, our algorithm integrates a mechanism to trace
the variations of the underlying graph structure, which captures the causal
relationships between the rewards in the bandit setting. Theoretically, we
establish a regret upper bound that reflects the effects of the number of
structural- and distribution changes on the performance. The outcome of our
numerical experiments in real-world scenarios exhibits applicability and
superior performance of our proposal compared to the state-of-the-art
benchmarks.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related
Rewards [7.0997346625024]
We formalize a non-stationary and delayed semi-bandit problem with causally related rewards.
We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making.
We evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
arXiv Detail & Related papers (2023-07-18T09:22:33Z) - Federated Learning for Heterogeneous Bandits with Unobserved Contexts [0.0]
We study the problem of federated multi-arm contextual bandits with unknown contexts.
We propose an elimination-based algorithm and prove the regret bound for linearly parametrized reward functions.
arXiv Detail & Related papers (2023-03-29T22:06:24Z) - Linear Combinatorial Semi-Bandit with Causally Related Rewards [5.347237827669861]
We propose a policy that determines the causal relations by learning the network's topology.
We establish a sublinear regret bound for the proposed algorithm.
arXiv Detail & Related papers (2022-12-25T16:05:21Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - On the Complexity of Adversarial Decision Making [101.14158787665252]
We show that the Decision-Estimation Coefficient is necessary and sufficient to obtain low regret for adversarial decision making.
We provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures.
arXiv Detail & Related papers (2022-06-27T06:20:37Z) - Break your Bandit Routine with LSD Rewards: a Last Switch Dependent
Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions.
Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function.
We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.