A State Representation for Diminishing Rewards
- URL: http://arxiv.org/abs/2309.03710v1
- Date: Thu, 7 Sep 2023 13:38:36 GMT
- Title: A State Representation for Diminishing Rewards
- Authors: Ted Moskovitz, Samo Hromadka, Ahmed Touati, Diana Borsa, Maneesh
Sahani
- Abstract summary: A common setting in multitask reinforcement learning (RL) demands that an agent rapidly adapt to various stationary reward functions randomly sampled from a fixed distribution.
In the natural world, sequential tasks are rarely independent, and instead reflect shifting priorities based on the availability and subjective perception of rewarding stimuli.
We introduce the $lambda$ representation ($lambda$R) which, surprisingly, is required for policy evaluation in this setting.
- Score: 20.945260614372327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A common setting in multitask reinforcement learning (RL) demands that an
agent rapidly adapt to various stationary reward functions randomly sampled
from a fixed distribution. In such situations, the successor representation
(SR) is a popular framework which supports rapid policy evaluation by
decoupling a policy's expected discounted, cumulative state occupancies from a
specific reward function. However, in the natural world, sequential tasks are
rarely independent, and instead reflect shifting priorities based on the
availability and subjective perception of rewarding stimuli. Reflecting this
disjunction, in this paper we study the phenomenon of diminishing marginal
utility and introduce a novel state representation, the $\lambda$
representation ($\lambda$R) which, surprisingly, is required for policy
evaluation in this setting and which generalizes the SR as well as several
other state representations from the literature. We establish the $\lambda$R's
formal properties and examine its normative advantages in the context of
machine learning, as well as its usefulness for studying natural behaviors,
particularly foraging.
Related papers
- Reward-Conditioned Reinforcement Learning [56.417273471201845]
We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications.<n>RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy.<n>Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
arXiv Detail & Related papers (2026-03-05T11:29:17Z) - Hierarchical Successor Representation for Robust Transfer [10.635248457021495]
We propose the Hierarchical Successor Representation (HSR)<n>By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes.<n>We show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.
arXiv Detail & Related papers (2026-02-13T09:32:26Z) - Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids [37.79354987519793]
We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints.<n>We propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set.
arXiv Detail & Related papers (2025-09-15T14:53:54Z) - RewardAnything: Generalizable Principle-Following Reward Models [82.16312590749052]
Reward models are typically trained on fixed preference datasets.<n>This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another.<n>We introduce generalizable, principle-following reward models.<n>We present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles.
arXiv Detail & Related papers (2025-06-04T07:30:16Z) - Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs [18.48866194756127]
Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory.<n>We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences.<n>We show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP setting -- another common multiobjective setting -- they do not.
arXiv Detail & Related papers (2025-05-17T15:23:58Z) - Likelihood Reward Redistribution [0.0]
We propose a emphLikelihood Reward Redistribution (LRR) framework for reward redistribution.
When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals.
arXiv Detail & Related papers (2025-03-20T20:50:49Z) - Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery [1.1394969272703013]
adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions.
This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors.
We show that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions.
arXiv Detail & Related papers (2024-10-10T06:21:32Z) - Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach [45.83200636718999]
A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.
We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution.
Experimental results show that our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-28T21:51:38Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Learning Symbolic Representations for Reinforcement Learning of
Non-Markovian Behavior [23.20013012953065]
We show how to automatically discover useful state abstractions that support learning automata over the state-action history.
The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL.
arXiv Detail & Related papers (2023-01-08T00:47:19Z) - Rewards Encoding Environment Dynamics Improves Preference-based
Reinforcement Learning [4.969254618158096]
We show that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks.
For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.
arXiv Detail & Related papers (2022-11-12T00:34:41Z) - Benefits of Permutation-Equivariance in Auction Mechanisms [90.42990121652956]
An auction mechanism that maximizes the auctioneer's revenue while minimizes bidders' ex-post regret is an important yet intricate problem in economics.
Remarkable progress has been achieved through learning the optimal auction mechanism by neural networks.
arXiv Detail & Related papers (2022-10-11T16:13:25Z) - Temporally Extended Successor Representations [0.9176056742068812]
We present a temporally extended variation of the successor representation, which we term t-SR.
t-SR captures the expected state transition dynamics of temporally extended actions by constructing successor representations over primitive action repeats.
We show that in environments with dynamic reward structure, t-SR is able to leverage both the flexibility of the successor representation and the abstraction afforded by temporally extended actions.
arXiv Detail & Related papers (2022-09-25T22:08:08Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - DisCo RL: Distribution-Conditioned Reinforcement Learning for
General-Purpose Policies [116.12670064963625]
We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn contextual policies.
We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.
arXiv Detail & Related papers (2021-04-23T16:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.