Related papers: Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids

Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids

URL: http://arxiv.org/abs/2509.12010v1
Date: Mon, 15 Sep 2025 14:53:54 GMT
Title: Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids
Authors: Filippo Lazzati, Alberto Maria Metelli,
Abstract summary: We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints.<n>We propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set.
Score: 37.79354987519793
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints. Inverse Reinforcement Learning (IRL) offers a promising solution by seeking to recover the expert's underlying reward function, which, if used for planning in the new settings, would reproduce the desired behavior. However, IRL is inherently ill-posed: multiple reward functions, forming the so-called feasible set, can explain the same observed behavior. Since these rewards may induce different policies in the new setting, in the absence of additional information, a decision criterion is needed to select which policy to deploy. In this paper, we propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set. Remarkably, we show that this policy can be obtained by planning with the reward centroid of that subset, for which we derive a closed-form expression. We then present a provably efficient algorithm for estimating this centroid using an offline dataset of expert demonstrations only. Finally, we conduct numerical simulations that illustrate the relationship between the expert's behavior and the behavior produced by our method.

Related papers

Learning Policy Representations for Steerable Behavior Synthesis [80.4542176039074]
Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time.<n>We show that these representations can be approximated uniformly for a range of policies using a set-based architecture.<n>We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions.
arXiv Detail & Related papers (2026-01-29T21:52:06Z)
Statistical analysis of Inverse Entropy-regularized Reinforcement Learning [15.054399128586232]
Inverse reinforcement learning aims to infer the reward function that explains expert behavior observed through trajectories of state--action pairs.<n>Many reward functions can induce the same optimal policy, rendering the inverse problem ill-posed.<n>We develop a statistical framework for Inverse Entropy-regularized Reinforcement Learning.
arXiv Detail & Related papers (2025-12-07T18:26:19Z)
Recursive Reward Aggregation [60.51668865089082]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Learning Causally Invariant Reward Functions from Diverse Demonstrations [6.351909403078771]
Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations. This adaptation often exhibits overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics. In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization.
arXiv Detail & Related papers (2024-09-12T12:56:24Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Distributional Method for Risk Averse Reinforcement Learning [0.0]
We introduce a distributional method for learning the optimal policy in risk averse Markov decision process. We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures.
arXiv Detail & Related papers (2023-02-27T19:48:42Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.