Towards Practical Credit Assignment for Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2106.04499v1
- Date: Tue, 8 Jun 2021 16:35:05 GMT
- Title: Towards Practical Credit Assignment for Deep Reinforcement Learning
- Authors: Vyacheslav Alipov, Riley Simmons-Edler, Nikita Putintsev, Pavel
Kalinin, Dmitry Vetrov
- Abstract summary: Credit assignment is a fundamental problem in reinforcement learning.
Recently, a family of methods called Hindsight Credit Assignment (HCA) was proposed, which explicitly assign credit to actions in hindsight.
We present a new algorithm, Credit-Constrained Advantage Actor-Critic (C2A2C), which ignores policy updates for actions which don't affect future outcomes based on credit in hindsight.
- Score: 0.6749750044497732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Credit assignment is a fundamental problem in reinforcement learning, the
problem of measuring an action's influence on future rewards. Improvements in
credit assignment methods have the potential to boost the performance of RL
algorithms on many tasks, but thus far have not seen widespread adoption.
Recently, a family of methods called Hindsight Credit Assignment (HCA) was
proposed, which explicitly assign credit to actions in hindsight based on the
probability of the action having led to an observed outcome. This approach is
appealing as a means to more efficient data usage, but remains a largely
theoretical idea applicable to a limited set of tabular RL tasks, and it is
unclear how to extend HCA to Deep RL environments. In this work, we explore the
use of HCA-style credit in a deep RL context. We first describe the limitations
of existing HCA algorithms in deep RL, then propose several
theoretically-justified modifications to overcome them. Based on this
exploration, we present a new algorithm, Credit-Constrained Advantage
Actor-Critic (C2A2C), which ignores policy updates for actions which don't
affect future outcomes based on credit in hindsight, while updating the policy
as normal for those that do. We find that C2A2C outperforms Advantage
Actor-Critic (A2C) on the Arcade Learning Environment (ALE) benchmark, showing
broad improvements over A2C and motivating further work on credit-constrained
update rules for deep RL methods.
Related papers
- Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks [5.968716050740402]
This paper focuses on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks.
We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent.
Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity.
arXiv Detail & Related papers (2024-02-14T10:44:03Z) - A Survey of Temporal Credit Assignment in Deep Reinforcement Learning [47.17998784925718]
The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforcement Learning (RL) agents to associate actions with their long-term consequences.
We propose a unifying formalism for credit that enables equitable comparisons of state-of-the-art algorithms.
We discuss the challenges posed by delayed effects, transpositions, and a lack of action influence, and analyse how existing methods aim to address them.
arXiv Detail & Related papers (2023-12-02T08:49:51Z) - Would I have gotten that reward? Long-term credit assignment by
counterfactual contribution analysis [50.926791529605396]
We introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms.
Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards.
arXiv Detail & Related papers (2023-06-29T09:27:27Z) - Decoupled Prioritized Resampling for Offline RL [120.49021589395005]
We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.
OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.
We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for
Last-Iterate Convergence in Constrained MDPs [31.663072540757643]
Reinforcement Learning (RL) has been applied to real-world problems with increasing success.
We introduce Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD)
arXiv Detail & Related papers (2023-02-02T18:05:27Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - When does return-conditioned supervised learning work for offline
reinforcement learning? [51.899892382786526]
We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
arXiv Detail & Related papers (2022-06-02T15:05:42Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Robust Deep Reinforcement Learning against Adversarial Perturbations on
State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises.
Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions.
We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.