Related papers: School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

URL: http://arxiv.org/abs/2102.11762v2
Date: Wed, 24 Feb 2021 07:54:32 GMT
Title: School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget
Authors: Omkar Shelke, Hardik Meisheri, Harshad Khadilkar
Abstract summary: Pommerman is a hybrid cooperative/adversarial multi-agent environment. This makes it a challenging environment for reinforcement learning approaches. We develop a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games.
Score: 4.726777092009554
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.

Related papers

Accelerating Residual Reinforcement Learning with Uncertainty Estimation [20.516264459225734]
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions.<n>While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies.<n>We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for base policies.
arXiv Detail & Related papers (2025-06-21T03:18:01Z)
Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally. In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z)
Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations [5.076419064097735]
Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. We propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states.
arXiv Detail & Related papers (2024-03-06T20:52:49Z)
Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies [42.709038827974375]
We study policy robustness under the well-accepted state-adrial attack model. We propose a novel training-time algorithm to iteratively discover textitnon-versadominated policies. Empirical validation on the Mujoco subroutine corroborates the superiority of our approach in terms of natural and robust performance.
arXiv Detail & Related papers (2024-02-20T02:45:20Z)
Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z)
Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS) DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z)
A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting. We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z)
Modeling Strong and Human-Like Gameplay with KL-Regularized Search [64.24339197581769]
We consider the task of building strong but human-like policies in multi-agent decision-making problems. Imitation learning is effective at predicting human actions but may not match the strength of expert humans. We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy.
arXiv Detail & Related papers (2021-12-14T16:52:49Z)
Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses. We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z)
Independent Policy Gradient Methods for Competitive Reinforcement Learning [62.91197073795261]
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
arXiv Detail & Related papers (2021-01-11T23:20:42Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.