Related papers: Equilibrium Bandits: Learning Optimal Equilibria of Unknown Dynamics

Equilibrium Bandits: Learning Optimal Equilibria of Unknown Dynamics

URL: http://arxiv.org/abs/2302.13653v1
Date: Mon, 27 Feb 2023 10:47:15 GMT
Title: Equilibrium Bandits: Learning Optimal Equilibria of Unknown Dynamics
Authors: Siddharth Chandak, Ilai Bistritz, Nicholas Bambos
Abstract summary: Consider a decision-maker that can pick one out of $K$ actions to control an unknown system, for $T$ turns. The dynamics of the system are unknown to the decision-maker, which can only observe a noisy reward at the end of every turn. Existing bandit algorithms, either adversarial, achieve linear (tau) regret for this problem. We present a novel algorithm that knows to switch an action quickly if it is not worth it to wait until the equilibrium is reached.
Score: 23.722837647516357
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Consider a decision-maker that can pick one out of $K$ actions to control an unknown system, for $T$ turns. The actions are interpreted as different configurations or policies. Holding the same action fixed, the system asymptotically converges to a unique equilibrium, as a function of this action. The dynamics of the system are unknown to the decision-maker, which can only observe a noisy reward at the end of every turn. The decision-maker wants to maximize its accumulated reward over the $T$ turns. Learning what equilibria are better results in higher rewards, but waiting for the system to converge to equilibrium costs valuable time. Existing bandit algorithms, either stochastic or adversarial, achieve linear (trivial) regret for this problem. We present a novel algorithm, termed Upper Equilibrium Concentration Bound (UECB), that knows to switch an action quickly if it is not worth it to wait until the equilibrium is reached. This is enabled by employing convergence bounds to determine how far the system is from equilibrium. We prove that UECB achieves a regret of $\mathcal{O}(\log(T)+\tau_c\log(\tau_c)+\tau_c\log\log(T))$ for this equilibrium bandit problem where $\tau_c$ is the worst case approximate convergence time to equilibrium. We then show that both epidemic control and game control are special cases of equilibrium bandits, where $\tau_c\log \tau_c$ typically dominates the regret. We then test UECB numerically for both of these applications.

Related papers

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound. Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z)
Corrupted Learning Dynamics in Games [62.73758165845971]
An equilibrium can be computed at a fast rate of $O(log T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL) We present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm.
arXiv Detail & Related papers (2024-12-10T02:23:44Z)
Computational Lower Bounds for Regret Minimization in Normal-Form Games [68.66209476382213]
We provide evidence that existing learning algorithms, such as multiplicative weights update, are close to optimal. Our results are obtained in the algorithmic framework put forward by Kothari and Mehta.
arXiv Detail & Related papers (2024-11-04T00:39:52Z)
Games played by Exponential Weights Algorithms [0.0]
We consider a repeated interaction in discrete time, where each player uses an exponential weights algorithm characterized by an initial mixed action and a fixed learning rate. We show that whenever a strict Nash equilibrium exists, the probability to play a strict Nash equilibrium at the next stage converges almost surely to 0 or 1.
arXiv Detail & Related papers (2024-07-09T08:49:51Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
On Tractable $Φ$-Equilibria in Non-Concave Games [53.212133025684224]
Non-concave games introduce significant game-theoretic and optimization challenges. We show that when $Phi$ is finite, there exists an efficient uncoupled learning algorithm that converges to the corresponding $Phi$-equilibria. We also show that Online Gradient Descent can efficiently approximate $Phi$-equilibria in non-trivial regimes.
arXiv Detail & Related papers (2024-03-13T01:51:30Z)
A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning [53.83345471268163]
We investigate learning the equilibria in non-stationary multi-agent systems. We show how to test for various types of equilibria by a black-box reduction to single-agent learning.
arXiv Detail & Related papers (2023-06-12T23:48:24Z)
Local Convergence of Gradient Methods for Min-Max Games: Partial Curvature Generically Suffices [0.5439020425819]
We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. We show that thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods.
arXiv Detail & Related papers (2023-05-26T21:43:56Z)
Bayes correlated equilibria and no-regret dynamics [9.89901717499058]
This paper explores equilibrium concepts for Bayesian games, which are fundamental models of games with incomplete information. We focus on communication equilibria, which can be realized by a mediator who gathers each player's private information and then sends correlated recommendations to the players. We present an efficient algorithm for minimizing untruthful swap regret with a sublinear upper bound, which we prove to be tight up to a multiplicative constant.
arXiv Detail & Related papers (2023-04-11T06:22:51Z)
How Bad is Selfish Driving? Bounding the Inefficiency of Equilibria in Urban Driving Games [64.71476526716668]
We study the (in)efficiency of any equilibrium players might agree to play. We obtain guarantees that refine existing bounds on the Price of Anarchy. Although the obtained guarantees concern open-loop trajectories, we observe efficient equilibria even when agents employ closed-loop policies.
arXiv Detail & Related papers (2022-10-24T09:32:40Z)
Safe Equilibrium [1.7132914341329848]
The standard game-theoretic solution concept, Nash equilibrium, assumes that all players behave rationally. We propose a new solution concept called safe equilibrium that models opponents as behaving rationally with a specified probability. We prove that a safe equilibrium exists in all strategic-form games, and prove that its computation is PPAD-hard.
arXiv Detail & Related papers (2022-01-12T01:45:51Z)
Multi-Leader Congestion Games with an Adversary [0.5914780964919123]
We study a multi-leader single-follower congestion game where multiple users (leaders) choose one resource out of a set of resources. We show that pure Nash equilibria may fail to exist and therefore, we consider approximate equilibria instead. We show how to compute efficiently a best approximate equilibrium, that is, with smallest possible $alpha$ among all $alpha$-approximate equilibria of the given instance.
arXiv Detail & Related papers (2021-12-14T14:47:43Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.