Self Punishment and Reward Backfill for Deep Q-Learning
- URL: http://arxiv.org/abs/2004.05002v2
- Date: Sat, 1 Jan 2022 19:47:24 GMT
- Title: Self Punishment and Reward Backfill for Deep Q-Learning
- Authors: Mohammad Reza Bonyadi, Rui Wang, Maryam Ziaei
- Abstract summary: Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment.
In many environments, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective.
We propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward.
- Score: 6.572828651397661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning agents learn by encouraging behaviours which maximize
their total reward, usually provided by the environment. In many environments,
however, the reward is provided after a series of actions rather than each
single action, leading the agent to experience ambiguity in terms of whether
those actions are effective, an issue known as the credit assignment problem.
In this paper, we propose two strategies inspired by behavioural psychology to
enable the agent to intrinsically estimate more informative reward values for
actions with no reward. The first strategy, called self-punishment (SP),
discourages the agent from making mistakes that lead to undesirable terminal
states. The second strategy, called the rewards backfill (RB), backpropagates
the rewards between two rewarded actions. We prove that, under certain
assumptions and regardless of the reinforcement learning algorithm used, these
two strategies maintain the order of policies in the space of all possible
policies in terms of their total reward, and, by extension, maintain the
optimal policy. Hence, our proposed strategies integrate with any reinforcement
learning algorithm that learns a value or action-value function through
experience. We incorporated these two strategies into three popular deep
reinforcement learning approaches and evaluated the results on thirty Atari
games. After parameter tuning, our results indicate that the proposed
strategies improve the tested methods in over 65 percent of tested games by up
to over 25 times performance improvement.
Related papers
- Fast Peer Adaptation with Context-aware Exploration [63.08444527039578]
We propose a peer identification reward for learning agents in multi-agent games.
This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation.
We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents.
arXiv Detail & Related papers (2024-02-04T13:02:27Z) - A State Augmentation based approach to Reinforcement Learning from Human
Preferences [20.13307800821161]
Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs.
We present a state augmentation technique that allows the agent's reward model to be robust.
arXiv Detail & Related papers (2023-02-17T07:10:50Z) - Credit-cognisant reinforcement learning for multi-agent cooperation [0.0]
We introduce the concept of credit-cognisant rewards, which allows an agent to perceive the effect its actions had on the environment as well as on its co-agents.
We show that by manipulating these experiences and constructing the reward contained within them to include the rewards received by all the agents within the same action sequence, we are able to improve significantly on the performance of independent deep Q-learning.
arXiv Detail & Related papers (2022-11-18T09:00:25Z) - Imitating Past Successes can be Very Suboptimal [145.70788608016755]
We show that existing outcome-conditioned imitation learning methods do not necessarily improve the policy.
We show that a simple modification results in a method that does guarantee policy improvement.
Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
arXiv Detail & Related papers (2022-06-07T15:13:43Z) - Execute Order 66: Targeted Data Poisoning for Reinforcement Learning [52.593097204559314]
We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states.
We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning.
We test our method and demonstrate success in two Atari games of varying difficulty.
arXiv Detail & Related papers (2022-01-03T17:09:32Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Disturbing Reinforcement Learning Agents with Corrupted Rewards [62.997667081978825]
We analyze the effects of different attack strategies based on reward perturbations on reinforcement learning algorithms.
We show that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards.
arXiv Detail & Related papers (2021-02-12T15:53:48Z) - Difference Rewards Policy Gradients [17.644110838053134]
We propose a novel algorithm that combines difference rewards with policy to allow for learning decentralized policies.
By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function.
We show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
arXiv Detail & Related papers (2020-12-21T11:23:17Z) - Joint Goal and Strategy Inference across Heterogeneous Demonstrators via
Reward Network Distillation [1.1470070927586016]
inverse reinforcement learning (IRL) seeks to learn a reward function from readily-obtained human demonstrations.
We propose a method to jointly infer a task goal and humans' strategic preferences via network distillation.
We demonstrate our algorithm can better recover task reward and strategy rewards and imitate the strategies in two simulated tasks and a real-world table tennis task.
arXiv Detail & Related papers (2020-01-02T16:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.