Related papers: Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition

Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition

URL: http://arxiv.org/abs/2302.01180v2
Date: Fri, 3 Feb 2023 10:23:48 GMT
Title: Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition
Authors: Peter Sunehag, Alexander Sasha Vezhnevets, Edgar Du\'e\~nez-Guzm\'an, Igor Mordach, Joel Z. Leibo
Abstract summary: We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches. We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
Score: 63.67574523750839
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many environments contain numerous available niches of variable value, each associated with a different local optimum in the space of behaviors (policy space). In such situations it is often difficult to design a learning process capable of evading distraction by poor local optima long enough to stumble upon the best available niche. In this work we propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in such environments with multiple variably-valued niches. The algorithm we propose consists of two parts: an agent architecture and a learning rule. The agent architecture contains multiple sub-policies. The learning rule is inspired by fitness sharing in evolutionary computation and applied in reinforcement learning using Value-Decomposition-Networks in a novel manner for a single-agent's internal population. It can concretely be understood as adding an extra loss term where one policy's experience is also used to update all the other policies in a manner that decreases their value estimates for the visited states. In particular, when one sub-policy visits a particular state frequently this decreases the value predicted for other sub-policies for going to that state. Further, we introduce an artificial chemistry inspired platform where it is easy to create tasks with multiple rewarding strategies utilizing different resources (i.e. multiple niches). We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies in both the artificial chemistry environments and in simpler illustrative environments.

Related papers

Can Learned Optimization Make Reinforcement Learning Less Difficult? [70.5036361852812]
We consider whether learned optimization can help overcome reinforcement learning difficulties. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed to these difficulties.
arXiv Detail & Related papers (2024-07-09T17:55:23Z)
OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge. Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors. In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Neural Inventory Control in Networks via Hindsight Differentiable Policy Optimization [5.590976834881065]
We argue that inventory management presents unique opportunities for reliably applying and evaluating deep reinforcement learning (DRL) algorithms. The first is Hindsight Differentiable Policy Optimization (HDPO), which performs gradient descent to optimize policy performance. The second technique involves aligning policy (neural) network structures with the structure of the inventory network.
arXiv Detail & Related papers (2023-06-20T02:58:25Z)
Distributed Multi-Agent Reinforcement Learning with One-hop Neighbors and Compute Straggler Mitigation [18.067507472516063]
This paper introduces a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N) DARL1N is an off-policy actor-critic method that addresses the curse of dimensionality by restricting information exchanges among the agents to one-hop neighbors. To mitigate the straggler effect, we introduce a novel coded distributed learning architecture, which leverages detrimental coding theory to improve the resilience of the learning system to stragglers.
arXiv Detail & Related papers (2022-02-18T04:55:09Z)
Learning a subspace of policies for online adaptation in Reinforcement Learning [14.7945053644125]
In control systems, the robot on which a policy is learned might differ from the robot on which a policy will run. There is a need to develop RL methods that generalize well to variations of the training conditions. In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time.
arXiv Detail & Related papers (2021-10-11T11:43:34Z)
Locality Matters: A Scalable Value Decomposition Approach for Cooperative Multi-Agent Reinforcement Learning [52.7873574425376]
Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents. We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Training Decentralized Execution paradigm.
arXiv Detail & Related papers (2021-09-22T10:08:15Z)
Recomposing the Reinforcement Learning Building Blocks with Hypernetworks [19.523737925041278]
We show that a primary network determines the weights of a conditional dynamic network. This approach improves the gradient approximation and reduces the learning step variance. We demonstrate a consistent improvement across different locomotion tasks and different algorithms both in RL (TD3 and SAC) and in Meta-RL (MAML and PEARL)
arXiv Detail & Related papers (2021-06-12T19:43:12Z)
Multi-agent navigation based on deep reinforcement learning and traditional pathfinding algorithm [0.0]
We develop a new framework for multi-agent collision avoidance problem. The framework combined traditional pathfinding algorithm and reinforcement learning. In our approach, the agents learn whether to be navigated or to take simple actions to avoid their partners.
arXiv Detail & Related papers (2020-12-05T08:56:58Z)
Meta-Gradient Reinforcement Learning with an Objective Discovered Online [54.15180335046361]
We propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network. Because the objective is discovered online, it can adapt to changes over time. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency.
arXiv Detail & Related papers (2020-07-16T16:17:09Z)
Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.