Refining Minimax Regret for Unsupervised Environment Design
- URL: http://arxiv.org/abs/2402.12284v2
- Date: Sat, 8 Jun 2024 10:08:25 GMT
- Title: Refining Minimax Regret for Unsupervised Environment Design
- Authors: Michael Beukman, Samuel Coward, Michael Matthews, Mattie Fellows, Minqi Jiang, Michael Dennis, Jakob Foerster,
- Abstract summary: We introduce level-perfect MMR, a refinement of the minimax regret objective.
We show that BLP policies act consistently with a Perfect Bayesian policy over all levels.
We also introduce an algorithm, ReMiDi, that results in a BLP policy at convergence.
- Score: 15.281908507614512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.
Related papers
- Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality [33.907054045921306]
We study agents acting in an unknown environment where the agent's goal is to find a robust policy.
We consider agents minimizing the maximum regret over different environment parameters, leading to the study of minimax regret.
This research focuses on deriving information-theoretic bounds for minimax regret in Markov Decision Processes.
arXiv Detail & Related papers (2024-10-21T13:45:02Z) - Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z) - Do LLM Agents Have Regret? A Case Study in Online Learning and Games [30.377709765198592]
Large language models (LLMs) have been increasingly employed for (interactive) decision-making.
We study their interactions in benchmark decision-making settings in online learning and game theory.
We propose a novel emphun training loss of emphregret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (supervised) actions.
arXiv Detail & Related papers (2024-03-25T15:04:11Z) - Emergency action termination for immediate reaction in hierarchical
reinforcement learning [8.637919344171255]
We propose a method in which the validity of higher-level actions (thus lower-level goals) is constantly verified at the higher level.
If the actions, i.e. lower level goals, become inadequate, they are replaced by more appropriate ones.
This way we combine the advantages of hierarchical RL, which is fast training, and flat RL, which is immediate reactivity.
arXiv Detail & Related papers (2022-11-11T16:56:02Z) - Regret Minimization and Convergence to Equilibria in General-sum Markov
Games [57.568118148036376]
We present the first algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
arXiv Detail & Related papers (2022-07-28T16:27:59Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - Robust Reinforcement Learning Under Minimax Regret for Green Security [50.03819244940543]
Green security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers.
We focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature.
We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy.
arXiv Detail & Related papers (2021-06-15T20:11:12Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Learning from Label Proportions: A Mutual Contamination Framework [19.772652254660674]
Learning from label proportions (LLP) is a weakly supervised setting for classification in which unlabeled training instances are grouped into bags, and each bag is annotated with the proportion of each class occurring in that bag.
In this work we address these two issues by posing LLP in terms of mutual contamination models (MCMs), which have recently been applied successfully to study various other weak supervision settings.
In the process, we establish several novel technical results for MCMs, including unbiased losses and generalization error bounds under non-iid sampling plans.
arXiv Detail & Related papers (2020-06-12T17:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.