School of hard knocks: Curriculum analysis for Pommerman with a fixed
computational budget
- URL: http://arxiv.org/abs/2102.11762v2
- Date: Wed, 24 Feb 2021 07:54:32 GMT
- Title: School of hard knocks: Curriculum analysis for Pommerman with a fixed
computational budget
- Authors: Omkar Shelke, Hardik Meisheri, Harshad Khadilkar
- Abstract summary: Pommerman is a hybrid cooperative/adversarial multi-agent environment.
This makes it a challenging environment for reinforcement learning approaches.
We develop a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games.
- Score: 4.726777092009554
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pommerman is a hybrid cooperative/adversarial multi-agent environment, with
challenging characteristics in terms of partial observability, limited or no
communication, sparse and delayed rewards, and restrictive computational time
limits. This makes it a challenging environment for reinforcement learning (RL)
approaches. In this paper, we focus on developing a curriculum for learning a
robust and promising policy in a constrained computational budget of 100,000
games, starting from a fixed base policy (which is itself trained to imitate a
noisy expert policy). All RL algorithms starting from the base policy use
vanilla proximal-policy optimization (PPO) with the same reward function, and
the only difference between their training is the mix and sequence of opponent
policies. One expects that beginning training with simpler opponents and then
gradually increasing the opponent difficulty will facilitate faster learning,
leading to more robust policies compared against a baseline where all available
opponent policies are introduced from the start. We test this hypothesis and
show that within constrained computational budgets, it is in fact better to
"learn in the school of hard knocks", i.e., against all available opponent
policies nearly from the start. We also include ablation studies where we study
the effect of modifying the base environment properties of ammo and bomb blast
strength on the agent performance.
Related papers
- Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Belief-Enriched Pessimistic Q-Learning against Adversarial State
Perturbations [5.076419064097735]
Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage.
Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy.
We propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states.
arXiv Detail & Related papers (2024-03-06T20:52:49Z) - Beyond Worst-case Attacks: Robust RL with Adaptive Defense via
Non-dominated Policies [42.709038827974375]
We study policy robustness under the well-accepted state-adrial attack model.
We propose a novel training-time algorithm to iteratively discover textitnon-versadominated policies.
Empirical validation on the Mujoco subroutine corroborates the superiority of our approach in terms of natural and robust performance.
arXiv Detail & Related papers (2024-02-20T02:45:20Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Modeling Strong and Human-Like Gameplay with KL-Regularized Search [64.24339197581769]
We consider the task of building strong but human-like policies in multi-agent decision-making problems.
Imitation learning is effective at predicting human actions but may not match the strength of expert humans.
We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy.
arXiv Detail & Related papers (2021-12-14T16:52:49Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - Independent Policy Gradient Methods for Competitive Reinforcement
Learning [62.91197073795261]
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents.
We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
arXiv Detail & Related papers (2021-01-11T23:20:42Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.