RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement
Learning Agents
- URL: http://arxiv.org/abs/2102.08159v2
- Date: Wed, 17 Feb 2021 02:02:32 GMT
- Title: RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement
Learning Agents
- Authors: Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An,
Svetlana Obraztsova, Zinovi Rabinovich
- Abstract summary: We propose a novel MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values.
We show that our method significantly outperforms state-of-the-art methods challenging StarCraft II tasks.
- Score: 40.51184157538392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current value-based multi-agent reinforcement learning methods optimize
individual Q values to guide individuals' behaviours via centralized training
with decentralized execution (CTDE). However, such expected, i.e.,
risk-neutral, Q value is not sufficient even with CTDE due to the randomness of
rewards and the uncertainty in environments, which causes the failure of these
methods to train coordinating agents in complex environments. To address these
issues, we propose RMIX, a novel cooperative MARL method with the Conditional
Value at Risk (CVaR) measure over the learned distributions of individuals' Q
values. Specifically, we first learn the return distributions of individuals to
analytically calculate CVaR for decentralized execution. Then, to handle the
temporal nature of the stochastic outcomes during executions, we propose a
dynamic risk level predictor for risk level tuning. Finally, we optimize the
CVaR policies with CVaR values used to estimate the target in TD error during
centralized training and the CVaR values are used as auxiliary local rewards to
update the local distribution via Quantile Regression loss. Empirically, we
show that our method significantly outperforms state-of-the-art methods on
challenging StarCraft II tasks, demonstrating enhanced coordination and
improved sample efficiency.
Related papers
- Provably Efficient Iterated CVaR Reinforcement Learning with Function
Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk.
We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations.
We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z) - Risk-Aware Distributed Multi-Agent Reinforcement Learning [8.287693091673658]
We develop a distributed MARL approach to solve decision-making problems in unknown environments by learning risk-aware actions.
We then propose a distributed MARL algorithm called the CVaR QD-Learning algorithm, and establish that value functions of individual agents reaches consensus.
arXiv Detail & Related papers (2023-04-04T17:56:44Z) - Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction [12.94372063457462]
Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms.
It suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state.
We propose an enhancement tool that accommodates any actor-critic MARL method.
arXiv Detail & Related papers (2022-09-02T13:44:00Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Risk-Aware Learning for Scalable Voltage Optimization in Distribution
Grids [19.0428894025206]
This paper aims to improve learning-enabled approaches by accounting for the potential risks associated with reactive power prediction and voltage deviation.
Specifically, we advocate to measure such risks using the conditional value-at-risk (CVaR) loss based on the worst-case samples only.
To tackle this issue, we propose to accelerate the training process under the CVaR loss objective by selecting the mini-batches that are more likely to contain the worst-case samples of interest.
arXiv Detail & Related papers (2021-10-04T15:00:13Z) - Learning Calibrated Uncertainties for Domain Shift: A Distributionally
Robust Learning Approach [150.8920602230832]
We propose a framework for learning calibrated uncertainties under domain shifts.
In particular, the density ratio estimation reflects the closeness of a target (test) sample to the source (training) distribution.
We show that our proposed method generates calibrated uncertainties that benefit downstream tasks.
arXiv Detail & Related papers (2020-10-08T02:10:54Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.