Related papers: Reinforcement Learning with Trajectory Feedback

Reinforcement Learning with Trajectory Feedback

URL: http://arxiv.org/abs/2008.06036v2
Date: Thu, 4 Mar 2021 20:04:50 GMT
Title: Reinforcement Learning with Trajectory Feedback
Authors: Yonathan Efroni, Nadav Merlis, Shie Mannor
Abstract summary: In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
Score: 76.94405309609552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

Related papers

Learning Markov Decision Processes under Fully Bandit Feedback [7.462282493793144]
A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Decision Process (MDP)<n>We provide the first efficient bandit learning algorithm for episodic MDPs with $widetildeO(sqrtT)$ regret.
arXiv Detail & Related papers (2026-02-02T16:03:24Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
Reinforcement Learning with Segment Feedback [56.54271464134885]
We consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. Under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.
arXiv Detail & Related papers (2025-02-03T23:08:42Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z)
Emergent representations in networks trained with the Forward-Forward algorithm [0.6597195879147556]
We show that the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity. Results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex.
arXiv Detail & Related papers (2023-05-26T14:39:46Z)
Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Learning Long-Term Reward Redistribution via Randomized Return Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z)
Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits [36.37578212532926]
We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
arXiv Detail & Related papers (2021-06-28T17:37:49Z)
RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation [7.0471949371778795]
We propose two reward functions for the task of abstractive summarisation. The first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update. The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward.
arXiv Detail & Related papers (2021-06-08T03:30:50Z)
On the Theory of Reinforcement Learning with Once-per-Episode Feedback [120.5537226120512]
We introduce a theory of reinforcement learning in which the learner receives feedback only once at the end of an episode. This is arguably more representative of real-world applications than the traditional requirement that the learner receive feedback at every time step.
arXiv Detail & Related papers (2021-05-29T19:48:51Z)
A Contraction Approach to Model-based Reinforcement Learning [11.701145942745274]
We analyze the error in the cumulative reward using a contraction approach. We prove that branched rollouts can reduce this error. In this case, we show that GAN-type learning has an advantage over Behavioral Cloning when its discriminator is well-trained.
arXiv Detail & Related papers (2020-09-18T02:03:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.