STAS: Spatial-Temporal Return Decomposition for Multi-agent
Reinforcement Learning
- URL: http://arxiv.org/abs/2304.07520v2
- Date: Thu, 4 Jan 2024 13:18:00 GMT
- Title: STAS: Spatial-Temporal Return Decomposition for Multi-agent
Reinforcement Learning
- Authors: Sirui Chen, Zhaowei Zhang, Yaodong Yang, Yali Du
- Abstract summary: We introduce a novel method that learns credit assignment in both temporal and spatial dimensions.
Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.
- Score: 10.102447181869005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Centralized Training with Decentralized Execution (CTDE) has been proven to
be an effective paradigm in cooperative multi-agent reinforcement learning
(MARL). One of the major challenges is credit assignment, which aims to credit
agents by their contributions. While prior studies have shown great success,
their methods typically fail to work in episodic reinforcement learning
scenarios where global rewards are revealed only at the end of the episode.
They lack the functionality to model complicated relations of the delayed
global reward in the temporal dimension and suffer from inefficiencies. To
tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a
novel method that learns credit assignment in both temporal and spatial
dimensions. It first decomposes the global return back to each time step, then
utilizes the Shapley Value to redistribute the individual payoff from the
decomposed global reward. To mitigate the computational complexity of the
Shapley Value, we introduce an approximation of marginal contribution and
utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice
& Bob example and MPE environments across different scenarios. Our results
demonstrate that our method effectively assigns spatial-temporal credit,
outperforming all state-of-the-art baselines.
Related papers
- Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models [69.51130760097818]
We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function.
We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks.
arXiv Detail & Related papers (2023-11-15T04:40:43Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Locality Matters: A Scalable Value Decomposition Approach for
Cooperative Multi-Agent Reinforcement Learning [52.7873574425376]
Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents.
We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Training Decentralized Execution paradigm.
arXiv Detail & Related papers (2021-09-22T10:08:15Z) - Multimodal Reward Shaping for Efficient Exploration in Reinforcement
Learning [8.810296389358134]
IRS modules rely on attendant models or additional memory to record and analyze learning procedures.
We introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer.
arXiv Detail & Related papers (2021-07-19T14:04:32Z) - Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning [34.856522993714535]
We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents.
Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.
arXiv Detail & Related papers (2021-06-01T07:38:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.