Unbiased Asymmetric Actor-Critic for Partially Observable Reinforcement
Learning
- URL: http://arxiv.org/abs/2105.11674v1
- Date: Tue, 25 May 2021 05:18:44 GMT
- Title: Unbiased Asymmetric Actor-Critic for Partially Observable Reinforcement
Learning
- Authors: Andrea Baisero and Christopher Amato
- Abstract summary: Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic.
We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant.
We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound.
- Score: 17.48572546628464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In partially observable reinforcement learning, offline training gives access
to latent information which is not available during online training and/or
execution, such as the system state. Asymmetric actor-critic methods exploit
such information by training a history-based policy via a state-based critic.
However, many asymmetric methods lack theoretical foundation, and are only
evaluated on limited domains. We examine the theory of asymmetric actor-critic
methods which use state-based critics, and expose fundamental issues which
undermine the validity of a common variant, and its ability to address high
partial observability. We propose an unbiased asymmetric actor-critic variant
which is able to exploit state information while remaining theoretically sound,
maintaining the validity of the policy gradient theorem, and introducing no
bias and relatively low variance into the training process. An empirical
evaluation performed on domains which exhibit significant partial observability
confirms our analysis, and shows the unbiased asymmetric actor-critic converges
to better policies and/or faster than symmetric actor-critic and standard
asymmetric actor-critic baselines.
Related papers
- On Centralized Critics in Multi-Agent Reinforcement Learning [16.361249170514828]
Training for Decentralized Execution has become a popular approach in Multi-Agent Reinforcement Learning.
We analyze the effect of using state-based critics in partially observable environments.
arXiv Detail & Related papers (2024-08-26T19:27:06Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Towards Evaluating Transfer-based Attacks Systematically, Practically,
and Fairly [79.07074710460012]
adversarial vulnerability of deep neural networks (DNNs) has drawn great attention.
An increasing number of transfer-based methods have been developed to fool black-box DNN models.
We establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods.
arXiv Detail & Related papers (2023-11-02T15:35:58Z) - Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation [13.325600043256552]
We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories.
Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
arXiv Detail & Related papers (2023-10-26T04:41:19Z) - A Deeper Understanding of State-Based Critics in Multi-Agent
Reinforcement Learning [17.36759906285316]
We show that state-based critics can introduce bias in the policy estimates, potentially undermining the guarantees of the algorithm.
We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition.
arXiv Detail & Related papers (2022-01-03T14:51:30Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - A Symmetric Loss Perspective of Reliable Machine Learning [87.68601212686086]
We review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization.
We demonstrate how the robust AUC method can benefit natural language processing in the problem where we want to learn only from relevant keywords.
arXiv Detail & Related papers (2021-01-05T06:25:47Z) - Learning Value Functions in Deep Policy Gradients using Residual
Variance [22.414430270991005]
Policy gradient algorithms have proven to be successful in diverse decision making and control tasks.
Traditional actor-critic algorithms do not succeed in fitting the true value function.
We provide a new state-value (resp. state-action-value) function approximation that learns the value of the states relative to their mean value.
arXiv Detail & Related papers (2020-10-09T08:57:06Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.