Reinforcement Learning with Trajectory Feedback
- URL: http://arxiv.org/abs/2008.06036v2
- Date: Thu, 4 Mar 2021 20:04:50 GMT
- Title: Reinforcement Learning with Trajectory Feedback
- Authors: Yonathan Efroni, Nadav Merlis, Shie Mannor
- Abstract summary: In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback.
Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory.
We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
- Score: 76.94405309609552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard feedback model of reinforcement learning requires revealing the
reward of every visited state-action pair. However, in practice, it is often
the case that such frequent feedback is not available. In this work, we take a
first step towards relaxing this assumption and require a weaker form of
feedback, which we refer to as \emph{trajectory feedback}. Instead of observing
the reward obtained after every action, we assume we only receive a score that
represents the quality of the whole trajectory observed by the agent, namely,
the sum of all rewards obtained over this trajectory. We extend reinforcement
learning algorithms to this setting, based on least-squares estimation of the
unknown reward, for both the known and unknown transition model cases, and
study the performance of these algorithms by analyzing their regret. For cases
where the transition model is unknown, we offer a hybrid optimistic-Thompson
Sampling approach that results in a tractable algorithm.
Related papers
- Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Emergent representations in networks trained with the Forward-Forward algorithm [0.6597195879147556]
We show that the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity.
Results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex.
arXiv Detail & Related papers (2023-05-26T14:39:46Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Learning from an Exploring Demonstrator: Optimal Reward Estimation for
Bandits [36.37578212532926]
We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance.
Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy.
We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
arXiv Detail & Related papers (2021-06-28T17:37:49Z) - On the Theory of Reinforcement Learning with Once-per-Episode Feedback [120.5537226120512]
We introduce a theory of reinforcement learning in which the learner receives feedback only once at the end of an episode.
This is arguably more representative of real-world applications than the traditional requirement that the learner receive feedback at every time step.
arXiv Detail & Related papers (2021-05-29T19:48:51Z) - A Contraction Approach to Model-based Reinforcement Learning [11.701145942745274]
We analyze the error in the cumulative reward using a contraction approach.
We prove that branched rollouts can reduce this error.
In this case, we show that GAN-type learning has an advantage over Behavioral Cloning when its discriminator is well-trained.
arXiv Detail & Related papers (2020-09-18T02:03:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.