Reward Design in Cooperative Multi-agent Reinforcement Learning for
Packet Routing
- URL: http://arxiv.org/abs/2003.03433v1
- Date: Thu, 5 Mar 2020 02:27:46 GMT
- Title: Reward Design in Cooperative Multi-agent Reinforcement Learning for
Packet Routing
- Authors: Hangyu Mao, Zhibo Gong, and Zhen Xiao
- Abstract summary: We study reward design problem in cooperative multi-agent reinforcement learning (MARL) based on packet routing environments.
We show that the above two reward signals are prone to produce suboptimal policies.
We design some mixed reward signals, which are off-the-shelf to learn better policies.
- Score: 8.021402935358488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In cooperative multi-agent reinforcement learning (MARL), how to design a
suitable reward signal to accelerate learning and stabilize convergence is a
critical problem. The global reward signal assigns the same global reward to
all agents without distinguishing their contributions, while the local reward
signal provides different local rewards to each agent based solely on
individual behavior. Both of the two reward assignment approaches have some
shortcomings: the former might encourage lazy agents, while the latter might
produce selfish agents.
In this paper, we study reward design problem in cooperative MARL based on
packet routing environments. Firstly, we show that the above two reward signals
are prone to produce suboptimal policies. Then, inspired by some observations
and considerations, we design some mixed reward signals, which are
off-the-shelf to learn better policies. Finally, we turn the mixed reward
signals into the adaptive counterparts, which achieve best results in our
experiments. Other reward signals are also discussed in this paper. As reward
design is a very fundamental problem in RL and especially in MARL, we hope that
MARL researchers can rethink the rewards used in their systems.
Related papers
- Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards [34.636688162807836]
Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents.
Our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic rewards.
We introduce BiMI, a novel reward function designed to mitigate noise.
arXiv Detail & Related papers (2024-09-24T09:45:20Z) - Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, designing immediate reward signals is difficult.
We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Automatic Reward Design via Learning Motivation-Consistent Intrinsic
Rewards [46.068337522093096]
We introduce the concept of motivation which captures the underlying goal of maximizing certain rewards.
Our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.
arXiv Detail & Related papers (2022-07-29T14:52:02Z) - Agent-Temporal Attention for Reward Redistribution in Episodic
Multi-Agent Reinforcement Learning [9.084006156825632]
This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal.
We introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges.
AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods.
arXiv Detail & Related papers (2022-01-12T18:35:46Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.