Continuous-Time Multi-Armed Bandits with Controlled Restarts
- URL: http://arxiv.org/abs/2007.00081v1
- Date: Tue, 30 Jun 2020 19:50:39 GMT
- Title: Continuous-Time Multi-Armed Bandits with Controlled Restarts
- Authors: Semih Cayci, Atilla Eryilmaz, R. Srikant
- Abstract summary: We investigate the bandit problem with controlled restarts for time-constrained decision processes.
In particular, we consider a bandit setting where each decision takes a random completion time, and yields a random and correlated reward at the end.
We develop efficient online learning algorithms with $O(log(tau))$ and $O(sqrttaulog(tau))$ regret in a finite and continuous action space of restart strategies.
- Score: 32.63624728528415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Time-constrained decision processes have been ubiquitous in many fundamental
applications in physics, biology and computer science. Recently, restart
strategies have gained significant attention for boosting the efficiency of
time-constrained processes by expediting the completion times. In this work, we
investigate the bandit problem with controlled restarts for time-constrained
decision processes, and develop provably good learning algorithms. In
particular, we consider a bandit setting where each decision takes a random
completion time, and yields a random and correlated reward at the end, with
unknown values at the time of decision. The goal of the decision-maker is to
maximize the expected total reward subject to a time constraint $\tau$. As an
additional control, we allow the decision-maker to interrupt an ongoing task
and forgo its reward for a potentially more rewarding alternative. For this
problem, we develop efficient online learning algorithms with $O(\log(\tau))$
and $O(\sqrt{\tau\log(\tau)})$ regret in a finite and continuous action space
of restart strategies, respectively. We demonstrate an applicability of our
algorithm by using it to boost the performance of SAT solvers.
Related papers
- Exploratory Optimal Stopping: A Singular Control Formulation [2.7309692684728613]
We explore continuous-time and state-space optimal stopping problems from a reinforcement learning perspective.
We introduce a regularized version of the problem by penalizing it with the cumulative residual entropy of the randomized stopping time.
For the specific case of a real option problem, we derive a semiexplicit solution to the regularized problem.
arXiv Detail & Related papers (2024-08-18T02:31:55Z) - Online Linear Programming with Batching [18.989352151219336]
We study Online Linear Programming (OLP) withOmega.
We show that the ability to delay decisions brings better operational performance, as measured by regret.
All the proposed algorithms delay decisions on customers arriving in only the first and the last batch.
arXiv Detail & Related papers (2024-08-01T06:13:24Z) - Frugal Algorithm Selection [1.079960007119637]
There is no single algorithm that works well for all instances of a problem.
In this work, we explore reducing this cost by choosing a subset of the training instances on which to train.
We approach this problem in three ways: using active learning to decide based on prediction uncertainty, augmenting the algorithm predictors with a timeout predictor, and collecting training data using a progressively increasing timeout.
arXiv Detail & Related papers (2024-05-17T19:23:30Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Efficient Algorithms for Planning with Participation Constraints [74.74967476995572]
We consider the problem of planning with participation constraints introduced in [Zhang et al., 2022]
In this problem, a principal chooses actions in a decision process, resulting in separate utilities for the principal and the agent.
We provide the first-time exact algorithm for this problem for finite-horizon settings, where previously only an additive $varepsilon$-approximation was known.
arXiv Detail & Related papers (2022-05-16T15:47:41Z) - Optimal Admission Control for Multiclass Queues with Time-Varying
Arrival Rates via State Abstraction [16.99621896314678]
We consider a novel queuing problem where the decision-maker must choose to accept or reject randomly arriving tasks.
The objective is to decide which tasks to accept so that the total price of tasks processed is maximised over a finite horizon.
We show that the optimal value function has a specific structure, which enables us to solve the hybrid MDP exactly.
arXiv Detail & Related papers (2022-03-14T12:38:13Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Learning to Schedule [3.5408022972081685]
This paper proposes a learning and scheduling algorithm to minimize the expected cumulative holding cost incurred by jobs.
In each time slot, the server can process a job while receiving the realized random holding costs of the jobs remaining in the system.
arXiv Detail & Related papers (2021-05-28T08:04:06Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Time-varying Gaussian Process Bandit Optimization with Non-constant
Evaluation Time [93.6788993843846]
We propose a novel time-varying Bayesian optimization algorithm that can effectively handle the non-constant evaluation time.
Our bound elucidates that a pattern of the evaluation time sequence can hugely affect the difficulty of the problem.
arXiv Detail & Related papers (2020-03-10T13:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.