Generalization, Mayhems and Limits in Recurrent Proximal Policy
Optimization
- URL: http://arxiv.org/abs/2205.11104v1
- Date: Mon, 23 May 2022 07:54:15 GMT
- Title: Generalization, Mayhems and Limits in Recurrent Proximal Policy
Optimization
- Authors: Marco Pleines, Matthias Pallasch, Frank Zimmer, Mike Preuss
- Abstract summary: We highlight vital details that one must get right when adding recurrence to achieve a correct and efficient implementation.
We explore the limitations of recurrent PPO by the benchmarking contributed novel environments Mortar Mayhem and Searing Spotlights.
Remarkably, we can demonstrate a transition to strong generalization in Mortar Mayhem when scaling the number of training seeds.
- Score: 1.8570591025615453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: At first sight it may seem straightforward to use recurrent layers in Deep
Reinforcement Learning algorithms to enable agents to make use of memory in the
setting of partially observable environments. Starting from widely used
Proximal Policy Optimization (PPO), we highlight vital details that one must
get right when adding recurrence to achieve a correct and efficient
implementation, namely: properly shaping the neural net's forward pass,
arranging the training data, correspondingly selecting hidden states for
sequence beginnings and masking paddings for loss computation. We further
explore the limitations of recurrent PPO by benchmarking the contributed novel
environments Mortar Mayhem and Searing Spotlights that challenge the agent's
memory beyond solely capacity and distraction tasks. Remarkably, we can
demonstrate a transition to strong generalization in Mortar Mayhem when scaling
the number of training seeds, while the agent does not succeed on Searing
Spotlights, which seems to be a tough challenge for memory-based agents.
Related papers
- PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead [24.611413814466978]
Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search.
Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference.
We propose Position-Embedding-Agnostic attention Re-weighting (PEAR) which enhances the context awareness of LLMs with zero inference overhead.
arXiv Detail & Related papers (2024-09-29T15:40:54Z) - Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.
We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.
We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Delay-Adapted Policy Optimization and Improved Regret for Adversarial
MDP with Delayed Bandit Feedback [10.957528713294874]
Policy Optimization is one of the most popular methods in Reinforcement Learning (RL)
We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods)
Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear $Q$-function, proving the first regret bounds for delayed feedback with function approximation.
arXiv Detail & Related papers (2023-05-13T12:40:28Z) - Understanding and Preventing Capacity Loss in Reinforcement Learning [28.52122927103544]
We identify a mechanism by which non-stationary prediction targets can prevent learning progress in deep RL agents.
Capacity loss occurs in a range of RL agents and environments, and is particularly damaging to performance in sparse-reward tasks.
arXiv Detail & Related papers (2022-04-20T15:55:15Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Posterior Meta-Replay for Continual Learning [4.319932092720977]
Continual Learning (CL) algorithms have recently received a lot of attention as they attempt to overcome the need to train with an i.i.d. sample from some unknown target data distribution.
We study principled ways to tackle the CL problem by adopting a Bayesian perspective and focus on continually learning a task-specific posterior distribution.
arXiv Detail & Related papers (2021-03-01T17:08:35Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Rapid Structural Pruning of Neural Networks with Set-based Task-Adaptive
Meta-Pruning [83.59005356327103]
A common limitation of most existing pruning techniques is that they require pre-training of the network at least once before pruning.
We propose STAMP, which task-adaptively prunes a network pretrained on a large reference dataset by generating a pruning mask on it as a function of the target dataset.
We validate STAMP against recent advanced pruning methods on benchmark datasets.
arXiv Detail & Related papers (2020-06-22T10:57:43Z) - MLE-guided parameter search for task loss minimization in neural
sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks.
We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient.
Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.