Related papers: Generalization, Mayhems and Limits in Recurrent Proximal Policy Optimization

Generalization, Mayhems and Limits in Recurrent Proximal Policy Optimization

URL: http://arxiv.org/abs/2205.11104v1
Date: Mon, 23 May 2022 07:54:15 GMT
Title: Generalization, Mayhems and Limits in Recurrent Proximal Policy Optimization
Authors: Marco Pleines, Matthias Pallasch, Frank Zimmer, Mike Preuss
Abstract summary: We highlight vital details that one must get right when adding recurrence to achieve a correct and efficient implementation. We explore the limitations of recurrent PPO by the benchmarking contributed novel environments Mortar Mayhem and Searing Spotlights. Remarkably, we can demonstrate a transition to strong generalization in Mortar Mayhem when scaling the number of training seeds.
Score: 1.8570591025615453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: At first sight it may seem straightforward to use recurrent layers in Deep Reinforcement Learning algorithms to enable agents to make use of memory in the setting of partially observable environments. Starting from widely used Proximal Policy Optimization (PPO), we highlight vital details that one must get right when adding recurrence to achieve a correct and efficient implementation, namely: properly shaping the neural net's forward pass, arranging the training data, correspondingly selecting hidden states for sequence beginnings and masking paddings for loss computation. We further explore the limitations of recurrent PPO by benchmarking the contributed novel environments Mortar Mayhem and Searing Spotlights that challenge the agent's memory beyond solely capacity and distraction tasks. Remarkably, we can demonstrate a transition to strong generalization in Mortar Mayhem when scaling the number of training seeds, while the agent does not succeed on Searing Spotlights, which seems to be a tough challenge for memory-based agents.

Related papers

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models [71.34520793462069]
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. We introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem.
arXiv Detail & Related papers (2025-04-15T10:41:11Z)
PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead [24.611413814466978]
Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search. Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference. We propose Position-Embedding-Agnostic attention Re-weighting (PEAR) which enhances the context awareness of LLMs with zero inference overhead.
arXiv Detail & Related papers (2024-09-29T15:40:54Z)
Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning. We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging. We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback [10.957528713294874]
Policy Optimization is one of the most popular methods in Reinforcement Learning (RL) We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods) Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear $Q$-function, proving the first regret bounds for delayed feedback with function approximation.
arXiv Detail & Related papers (2023-05-13T12:40:28Z)
Understanding and Preventing Capacity Loss in Reinforcement Learning [28.52122927103544]
We identify a mechanism by which non-stationary prediction targets can prevent learning progress in deep RL agents. Capacity loss occurs in a range of RL agents and environments, and is particularly damaging to performance in sparse-reward tasks.
arXiv Detail & Related papers (2022-04-20T15:55:15Z)
APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized. The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z)
Posterior Meta-Replay for Continual Learning [4.319932092720977]
Continual Learning (CL) algorithms have recently received a lot of attention as they attempt to overcome the need to train with an i.i.d. sample from some unknown target data distribution. We study principled ways to tackle the CL problem by adopting a Bayesian perspective and focus on continually learning a task-specific posterior distribution.
arXiv Detail & Related papers (2021-03-01T17:08:35Z)
Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences. We show how such pretraining can better support solving hard classification tasks with long sequences. We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z)
Rapid Structural Pruning of Neural Networks with Set-based Task-Adaptive Meta-Pruning [83.59005356327103]
A common limitation of most existing pruning techniques is that they require pre-training of the network at least once before pruning. We propose STAMP, which task-adaptively prunes a network pretrained on a large reference dataset by generating a pruning mask on it as a function of the target dataset. We validate STAMP against recent advanced pruning methods on benchmark datasets.
arXiv Detail & Related papers (2020-06-22T10:57:43Z)
MLE-guided parameter search for task loss minimization in neural sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks. We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient. Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.