The Advantage Regret-Matching Actor-Critic
- URL: http://arxiv.org/abs/2008.12234v1
- Date: Thu, 27 Aug 2020 16:30:17 GMT
- Title: The Advantage Regret-Matching Actor-Critic
- Authors: Audr\=unas Gruslys, Marc Lanctot, R\'emi Munos, Finbarr Timbers,
Martin Schmid, Julien Perolat, Dustin Morrill, Vinicius Zambaldi,
Jean-Baptiste Lespiau, John Schultz, Mohammad Gheshlaghi Azar, Michael
Bowling, and Karl Tuyls
- Abstract summary: We propose a model-free reinforcement learning algorithm for no-regret learning.
We use retrospective value estimates to predict conditional advantages which, combined with regret matching, produce a new policy.
In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact.
- Score: 31.475994100183794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regret minimization has played a key role in online learning, equilibrium
computation in games, and reinforcement learning (RL). In this paper, we
describe a general model-free RL method for no-regret learning based on
repeated reconsideration of past behavior. We propose a model-free RL
algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than
saving past state-action data, ARMAC saves a buffer of past policies, replaying
through them to reconstruct hindsight assessments of past behavior. These
retrospective value estimates are used to predict conditional advantages which,
combined with regret matching, produces a new policy. In particular, ARMAC
learns from sampled trajectories in a centralized training setting, without
requiring the application of importance sampling commonly used in Monte Carlo
counterfactual regret (CFR) minimization; hence, it does not suffer from
excessive variance in large environments. In the single-agent setting, ARMAC
shows an interesting form of exploration by keeping past policies intact. In
the multiagent setting, ARMAC in self-play approaches Nash equilibria on some
partially-observable zero-sum benchmarks. We provide exploitability estimates
in the significantly larger game of betting-abstracted no-limit Texas Hold'em.
Related papers
- May the Forgetting Be with You: Alternate Replay for Learning with Noisy Labels [16.262555459431155]
We introduce Alternate Experience Replay (AER), which takes advantage of forgetting to maintain a clear distinction between clean, complex, and noisy samples in the memory buffer.
We demonstrate the effectiveness of our approach in terms of both accuracy and purity of the obtained buffer, resulting in a remarkable average gain of 4.71% points in accuracy with respect to existing loss-based purification strategies.
arXiv Detail & Related papers (2024-08-26T14:09:40Z) - Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - When Learning Is Out of Reach, Reset: Generalization in Autonomous
Visuomotor Reinforcement Learning [10.469509984098705]
Episodic training, where an agent's environment is reset after every success or failure, is the de facto standard when training embodied reinforcement learning (RL) agents.
In this work, we look to minimize, rather than completely eliminate, resets while building visual agents that can meaningfully generalize.
Our proposed approach significantly outperforms prior episodic, reset-free, and reset-minimizing approaches achieving higher success rates.
arXiv Detail & Related papers (2023-03-30T17:59:26Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Model-Free Online Learning in Unknown Sequential Decision Making
Problems and Games [114.90723492840499]
In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium.
We formalize an online learning setting in which the strategy space is not known to the agent.
We give an efficient algorithm that achieves $O(T3/4)$ regret with high probability for that setting, even when the agent faces an adversarial environment.
arXiv Detail & Related papers (2021-03-08T04:03:24Z) - Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy
Reinforcement Learning [17.3794999533024]
We show that deep RL appears to struggle in the presence of extraneous data.
Recent works have shown that the performance of Deep Q-Network (DQN) degrades when its replay memory becomes too large.
We re-examine the motivation for sampling uniformly over a replay memory, and find that it may be flawed when using function approximation.
arXiv Detail & Related papers (2021-02-22T19:29:18Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.