Bandit Learning with Delayed Impact of Actions
- URL: http://arxiv.org/abs/2002.10316v4
- Date: Sun, 31 Oct 2021 17:56:24 GMT
- Title: Bandit Learning with Delayed Impact of Actions
- Authors: Wei Tang, Chien-Ju Ho, Yang Liu
- Abstract summary: We consider a multi-armed bandit (MAB) problem with delayed impact of actions.
In our setting, actions taken in the past impact the arm rewards in the subsequent future.
This delayed impact of actions is prevalent in the real world.
- Score: 25.508329225188586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a stochastic multi-armed bandit (MAB) problem with delayed impact
of actions. In our setting, actions taken in the past impact the arm rewards in
the subsequent future. This delayed impact of actions is prevalent in the real
world. For example, the capability to pay back a loan for people in a certain
social group might depend on historically how frequently that group has been
approved loan applications. If banks keep rejecting loan applications to people
in a disadvantaged group, it could create a feedback loop and further damage
the chance of getting loans for people in that group. In this paper, we
formulate this delayed and long-term impact of actions within the context of
multi-armed bandits. We generalize the bandit setting to encode the dependency
of this "bias" due to the action history during learning. The goal is to
maximize the collected utilities over time while taking into account the
dynamics created by the delayed impacts of historical actions. We propose an
algorithm that achieves a regret of $\tilde{\mathcal{O}}(KT^{2/3})$ and show a
matching regret lower bound of $\Omega(KT^{2/3})$, where $K$ is the number of
arms and $T$ is the learning horizon. Our results complement the bandit
literature by adding techniques to deal with actions with long-term impacts and
have implications in designing fair algorithms.
Related papers
- Learning for Bandits under Action Erasures [20.235500642661812]
We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels.
In our model, while the distributed agents know if an action is erased, the central learner does not.
We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures.
arXiv Detail & Related papers (2024-06-26T05:03:00Z) - Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback [11.262167746929332]
Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems.
In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards.
We develop a novel reinforcement learning algorithm with two key contributors.
arXiv Detail & Related papers (2024-05-02T02:20:19Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Adversarial Bandits with Multi-User Delayed Feedback: Theory and
Application [17.64363983613468]
We formulate an adversarial MAB problem with multi-user delayed feedback and design a modified EXP3 algorithm MUD-EXP3.
In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution.
arXiv Detail & Related papers (2023-10-17T12:08:15Z) - Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality [46.445321251991096]
In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions.
We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps.
We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
arXiv Detail & Related papers (2023-05-04T15:59:43Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Addressing the Long-term Impact of ML Decisions via Policy Regret [49.92903850297013]
We study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm.
We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account.
We present an algorithm with provably sub-linear policy regret for sufficiently long time horizons.
arXiv Detail & Related papers (2021-06-02T17:38:10Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Learning by Repetition: Stochastic Multi-armed Bandits under Priming
Effect [2.5966580648312223]
We study the effect of persistence of engagement on learning in a multi-armed bandit setting.
We provide novel algorithms that achieves sublinear regret in time and the relevant wear-in/wear-out parameters.
arXiv Detail & Related papers (2020-06-18T08:27:23Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Adversarial Attacks on Linear Contextual Bandits [87.08004581867537]
Malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior.
We show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm $T - o(T)$ times over a horizon of $T$ steps.
We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context.
arXiv Detail & Related papers (2020-02-10T15:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.