Related papers: Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

URL: http://arxiv.org/abs/2502.14131v1
Date: Wed, 19 Feb 2025 22:22:20 GMT
Title: Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model
Authors: Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain,
Abstract summary: We study the problem of estimating Dynamic Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning ( offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q*$ functions that govern agent behavior from offline behavior data. We propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards.
Score: 9.531082746970286
License:
Abstract: We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

Related papers

Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
A Unified Theory of Stochastic Proximal Point Methods without Smoothness [52.30944052987393]
Proximal point methods have attracted considerable interest owing to their numerical stability and robustness against imperfect tuning. This paper presents a comprehensive analysis of a broad range of variations of the proximal point method (SPPM)
arXiv Detail & Related papers (2024-05-24T21:09:19Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Robust Imitation via Mirror Descent Inverse Reinforcement Learning [18.941048578572577]
This paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem. We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence. Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.
arXiv Detail & Related papers (2022-10-20T12:25:21Z)
Safe Continuous Control with Constrained Model-Based Policy Optimization [0.0]
We introduce a model-based safe exploration algorithm for constrained high-dimensional control. We also introduce a practical algorithm that accelerates policy search with model-generated data.
arXiv Detail & Related papers (2021-04-14T15:20:55Z)
Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic. We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds. We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z)
GELATO: Geometrically Enriched Latent Model for Offline Reinforcement Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods. In this work, we demonstrate the benefit of combining the two in a latent variational model. Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon [3.867363075280544]
We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem. We produce a global linear convergence guarantee for the setting of finite time horizon and state dynamics under weak assumptions. We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly.
arXiv Detail & Related papers (2020-11-20T09:51:49Z)
SODEN: A Scalable Continuous-Time Survival Model through Ordinary Differential Equation Networks [14.564168076456822]
We propose a flexible model for survival analysis using neural networks along with scalable optimization algorithms. We demonstrate the effectiveness of the proposed method in comparison to existing state-of-the-art deep learning survival analysis models.
arXiv Detail & Related papers (2020-08-19T19:11:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.