Bandit Linear Optimization for Sequential Decision Making and
Extensive-Form Games
- URL: http://arxiv.org/abs/2103.04546v1
- Date: Mon, 8 Mar 2021 05:00:13 GMT
- Title: Bandit Linear Optimization for Sequential Decision Making and
Extensive-Form Games
- Authors: Gabriele Farina, Robin Schmucker, Tuomas Sandholm
- Abstract summary: Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment.
It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history.
In this paper, we give the first algorithm for the bandit linear optimization problem for dilatedDM that offers both (i) linear-time losses and (ii) $O(sqrtT)$ cumulative regret in
- Score: 102.23975166536326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tree-form sequential decision making (TFSDM) extends classical one-shot
decision making by modeling tree-form interactions between an agent and a
potentially adversarial environment. It captures the online decision-making
problems that each player faces in an extensive-form game, as well as Markov
decision processes and partially-observable Markov decision processes where the
agent conditions on observed history. Over the past decade, there has been
considerable effort into designing online optimization methods for TFSDM.
Virtually all of that work has been in the full-feedback setting, where the
agent has access to counterfactuals, that is, information on what would have
happened had the agent chosen a different action at any decision node. Little
is known about the bandit setting, where that assumption is reversed (no
counterfactual information is available), despite this latter setting being
well understood for almost 20 years in one-shot decision making. In this paper,
we give the first algorithm for the bandit linear optimization problem for
TFSDM that offers both (i) linear-time iterations (in the size of the decision
tree) and (ii) $O(\sqrt{T})$ cumulative regret in expectation compared to any
fixed strategy, at all times $T$. This is made possible by new results that we
derive, which may have independent uses as well: 1) geometry of the dilated
entropy regularizer, 2) autocorrelation matrix of the natural sampling scheme
for sequence-form strategies, 3) construction of an unbiased estimator for
linear losses for sequence-form strategies, and 4) a refined regret analysis
for mirror descent when using the dilated entropy regularizer.
Related papers
- Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Constrained Online Two-stage Stochastic Optimization: Near Optimal Algorithms via Adversarial Learning [1.994307489466967]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods.
We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2023-02-02T10:33:09Z) - Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence.
We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix.
We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - An Efficient Algorithm for Cooperative Semi-Bandits [0.0]
We introduce Coop-FTPL, a cooperative version of the well-known Follow The Perturbed Leader algorithm.
We show that the expected regret of our algorithm after T time steps is of order QT log(k)(k$alpha$ 1 /Q + m), where Q is the total activation probability mass.
arXiv Detail & Related papers (2020-10-05T07:08:26Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.