Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games
- URL: http://arxiv.org/abs/2411.06069v1
- Date: Sat, 09 Nov 2024 05:03:10 GMT
- Title: Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games
- Authors: Alireza Masoumian, James R. Wright,
- Abstract summary: We propose an online model selection algorithm for the average reward RL setting.
We show that the additional cost of model selection scales only linearly in $M$, the number of model classes.
We also show that the exponential dependency on $m*$ is inevitable by proving a lower bound on the learner's regret.
- Score: 1.9161424760743742
- License:
- Abstract: In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\mathsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in $\tilde O(M C_{m^*}^2 \mathsf{B}_{m^*}(T,\delta))$ where $C_{m^*}$ represents the complexity of the simplest well-specified model class and $\mathsf{B}_{m^*}(T,\delta)$ is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in $M$, the number of model classes. We apply $\mathsf{MRBEAR}$ to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner's goal is to maximize its utility without knowing the opponent's utility function. The interaction is over $T$ rounds with no episode or discounting which leads us to measure the learner's performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in $\tilde O(M(\mathsf{sp}(h^*) B^{m^*} A^{m^*+1})^{\frac{3}{2}} \sqrt{T})$, where $m^*\le M$ is the unknown memory limit of the opponent, $\mathsf{sp}(h^*)$ is the unknown span of optimal bias induced by the opponent, and $A$ and $B$ are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on $m^*$ is inevitable by proving a lower bound on the learner's regret.
Related papers
- Online Learning of Halfspaces with Massart Noise [47.71073318490341]
We study the task of online learning in the presence of Massart noise.
We present a computationally efficient algorithm that achieves mistake bound $eta T + o(T)$.
We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k) Delta T - o(T)$ bigger than choosing a random action at every round.
arXiv Detail & Related papers (2024-05-21T17:31:10Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Fast Rates for Nonparametric Online Learning: From Realizability to
Learning in Games [36.969021834291745]
We propose a proper learning algorithm which gets a near-optimal mistake bound in terms of the sequential fat-shattering dimension of the hypothesis class.
This result answers a question as to whether proper learners could achieve near-optimal mistake bounds.
For the real-valued (regression) setting, the optimal mistake bound was not even known for improper learners.
arXiv Detail & Related papers (2021-11-17T05:24:21Z) - Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z) - Near-optimal Representation Learning for Linear Bandits and Linear RL [41.33483293243257]
We first consider the setting where we play $M$ linear bandits with dimension $d$ concurrently.
These bandits share a common $k$-dimensional linear representation so that $kll d$ and $k ll M$.
We propose a sample-efficient algorithm, MTLR-OFUL, which leverages the shared representation to achieve $tildeO(MsqrtdkT + dsqrtkMT )$ regret.
arXiv Detail & Related papers (2021-02-08T11:11:53Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Model-Based Reinforcement Learning with Value-Targeted Regression [48.92439657407732]
We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $mathcalP$.
We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $tildemathcalO(dsqrtH3T)$.
arXiv Detail & Related papers (2020-06-01T17:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.