Related papers: Self Punishment and Reward Backfill for Deep Q-Learning

Self Punishment and Reward Backfill for Deep Q-Learning

URL: http://arxiv.org/abs/2004.05002v2
Date: Sat, 1 Jan 2022 19:47:24 GMT
Title: Self Punishment and Reward Backfill for Deep Q-Learning
Authors: Mohammad Reza Bonyadi, Rui Wang, Maryam Ziaei
Abstract summary: Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective. We propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward.
Score: 6.572828651397661
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this paper, we propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the rewards backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the reinforcement learning algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward, and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any reinforcement learning algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep reinforcement learning approaches and evaluated the results on thirty Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65 percent of tested games by up to over 25 times performance improvement.

Related papers

RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning [1.7095639309883044]
Reinforcement learning (RL) learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. This paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning.
arXiv Detail & Related papers (2025-01-13T17:19:34Z)
Fast Peer Adaptation with Context-aware Exploration [63.08444527039578]
We propose a peer identification reward for learning agents in multi-agent games. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents.
arXiv Detail & Related papers (2024-02-04T13:02:27Z)
A State Augmentation based approach to Reinforcement Learning from Human Preferences [20.13307800821161]
Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs. We present a state augmentation technique that allows the agent's reward model to be robust.
arXiv Detail & Related papers (2023-02-17T07:10:50Z)
Credit-cognisant reinforcement learning for multi-agent cooperation [0.0]
We introduce the concept of credit-cognisant rewards, which allows an agent to perceive the effect its actions had on the environment as well as on its co-agents. We show that by manipulating these experiences and constructing the reward contained within them to include the rewards received by all the agents within the same action sequence, we are able to improve significantly on the performance of independent deep Q-learning.
arXiv Detail & Related papers (2022-11-18T09:00:25Z)
Imitating Past Successes can be Very Suboptimal [145.70788608016755]
We show that existing outcome-conditioned imitation learning methods do not necessarily improve the policy. We show that a simple modification results in a method that does guarantee policy improvement. Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
arXiv Detail & Related papers (2022-06-07T15:13:43Z)
Execute Order 66: Targeted Data Poisoning for Reinforcement Learning [52.593097204559314]
We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states. We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning. We test our method and demonstrate success in two Atari games of varying difficulty.
arXiv Detail & Related papers (2022-01-03T17:09:32Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Disturbing Reinforcement Learning Agents with Corrupted Rewards [62.997667081978825]
We analyze the effects of different attack strategies based on reward perturbations on reinforcement learning algorithms. We show that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards.
arXiv Detail & Related papers (2021-02-12T15:53:48Z)
Difference Rewards Policy Gradients [17.644110838053134]
We propose a novel algorithm that combines difference rewards with policy to allow for learning decentralized policies. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function. We show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
arXiv Detail & Related papers (2020-12-21T11:23:17Z)
Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation [1.1470070927586016]
inverse reinforcement learning (IRL) seeks to learn a reward function from readily-obtained human demonstrations. We propose a method to jointly infer a task goal and humans' strategic preferences via network distillation. We demonstrate our algorithm can better recover task reward and strategy rewards and imitate the strategies in two simulated tasks and a real-world table tennis task.
arXiv Detail & Related papers (2020-01-02T16:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.