Identifiability in inverse reinforcement learning
- URL: http://arxiv.org/abs/2106.03498v1
- Date: Mon, 7 Jun 2021 10:35:52 GMT
- Title: Identifiability in inverse reinforcement learning
- Authors: Haoyang Cao, Samuel N. Cohen and Lukasz Szpruch
- Abstract summary: Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem.
We provide a resolution to this non-identifiability for problems with entropy regularization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse reinforcement learning attempts to reconstruct the reward function in
a Markov decision problem, using observations of agent actions. As already
observed by Russell the problem is ill-posed, and the reward function is not
identifiable, even under the presence of perfect information about optimal
behavior. We provide a resolution to this non-identifiability for problems with
entropy regularization. For a given environment, we fully characterize the
reward functions leading to a given policy and demonstrate that, given
demonstrations of actions for the same reward under two distinct discount
factors, or under sufficiently different environments, the unobserved reward
can be recovered up to a constant. Through a simple numerical experiment, we
demonstrate the accurate reconstruction of the reward function through our
proposed resolution.
Related papers
- Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation [8.897510324428138]
We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer aims to recover the underlying problem parameters.<n>During the learning process, the learner's behavior naturally transitions from exploration to exploitation.<n>We propose a simple and effective framework called Two-Phase Suffix to address this issue.
arXiv Detail & Related papers (2026-03-04T06:36:13Z) - Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization [52.74762030521324]
We propose a novel algorithm to learn reward functions from observed actions.<n>We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm.
arXiv Detail & Related papers (2026-01-19T04:12:51Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z) - Robustness in the Face of Partial Identifiability in Reward Learning [37.79354987519793]
We introduce a general Reward Learning (ReL) framework that permits to quantify the drop in "performance" suffered in considered applications.<n>We then develop Rob-ReL, a ReL algorithm that applies this robust approach to the subset of ReL problems aimed at assessing a preference between two policies.
arXiv Detail & Related papers (2025-01-10T22:55:37Z) - Learning Causally Invariant Reward Functions from Diverse Demonstrations [6.351909403078771]
Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations.
This adaptation often exhibits overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics.
In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization.
arXiv Detail & Related papers (2024-09-12T12:56:24Z) - Transductive Reward Inference on Graph [53.003245457089406]
We develop a reward inference method based on the contextual properties of information propagation on graphs.
We leverage both the available data and limited reward annotations to construct a reward propagation graph.
We employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data.
arXiv Detail & Related papers (2024-02-06T03:31:28Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Identifiability and generalizability from multiple experts in Inverse
Reinforcement Learning [39.632717308147825]
Reinforcement Learning (RL) aims to train an agent from a reward function in a given environment.
Inverse Reinforcement Learning (IRL) seeks to recover the reward function from observing an expert's behavior.
arXiv Detail & Related papers (2022-09-22T12:50:00Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - On the Expressivity of Markov Reward [89.96685777114456]
This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform.
We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories.
arXiv Detail & Related papers (2021-11-01T12:12:16Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Outcome-Driven Reinforcement Learning via Variational Inference [95.82770132618862]
We discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards.
To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function.
We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
arXiv Detail & Related papers (2021-04-20T18:16:21Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.