Related papers: Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

URL: http://arxiv.org/abs/2512.24407v1
Date: Tue, 30 Dec 2025 18:41:05 GMT
Title: Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models
Authors: Lars van der Laan, Aurelien Bibaut, Nathan Kallus,
Abstract summary: Inverse reinforcement learning (IRL) and dynamic discrete choice (DDC) models explain sequential decision-making by recovering reward functions that rationalize observed behavior.<n>We develop a semiparametric framework for debiased inverse reinforcement learning that yields statistically efficient inference for a broad class of reward-dependent functionals.
Score: 35.877107409163784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inverse reinforcement learning (IRL) and dynamic discrete choice (DDC) models explain sequential decision-making by recovering reward functions that rationalize observed behavior. Flexible IRL methods typically rely on machine learning but provide no guarantees for valid inference, while classical DDC approaches impose restrictive parametric specifications and often require repeated dynamic programming. We develop a semiparametric framework for debiased inverse reinforcement learning that yields statistically efficient inference for a broad class of reward-dependent functionals in maximum entropy IRL and Gumbel-shock DDC models. We show that the log-behavior policy acts as a pseudo-reward that point-identifies policy value differences and, under a simple normalization, the reward itself. We then formalize these targets, including policy values under known and counterfactual softmax policies and functionals of the normalized reward, as smooth functionals of the behavior policy and transition kernel, establish pathwise differentiability, and derive their efficient influence functions. Building on this characterization, we construct automatic debiased machine-learning estimators that allow flexible nonparametric estimation of nuisance components while achieving $\sqrt{n}$-consistency, asymptotic normality, and semiparametric efficiency. Our framework extends classical inference for DDC models to nonparametric rewards and modern machine-learning tools, providing a unified and computationally tractable approach to statistical inference in IRL.

Related papers

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics [6.208369829942616]
We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm.<n>ULD unifies the efficiency of model-free methods with the representational strengths of model-based approaches.<n> evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari.
arXiv Detail & Related papers (2026-02-13T06:06:56Z)
Composable Model-Free RL for Navigation with Input-Affine Systems [3.2917282915992883]
As autonomous robots move into complex, dynamic real-world environments, they must learn to navigate safely in real time.<n>We propose a composable, model-free reinforcement learning method that learns a value function and an optimal policy for each individual environment element.
arXiv Detail & Related papers (2026-02-13T00:19:35Z)
ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning [32.8666744273094]
We introduce textbfADORA (textbfAdvantage textbfDynamics via textbfOnline textbfRollout textbfAdaptation), a novel framework for policy optimization.
arXiv Detail & Related papers (2026-02-10T17:40:39Z)
Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
Inverse Reinforcement Learning Using Just Classification and a Few Regressions [38.71913609455455]
Inverse reinforcement learning aims to explain observed behavior by uncovering an underlying reward.<n>We show that the population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy.<n>We provide a precise characterization of the optimal solution, a generic oracle-based algorithm, finite-sample error bounds, and empirical results showing competitive or superior performance to MaxEnt IRL.
arXiv Detail & Related papers (2025-09-25T13:53:43Z)
Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation [28.63391989014238]
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time.<n>We propose a model-based algorithm that achieves both sample and computational efficiency.<n>We show that a near-optimal policy can be learned with a suboptimality gap of $tildeO(sqrtd_mathcalR + d_mathcalFN-1/2)$ using $N$ measurements.
arXiv Detail & Related papers (2025-05-20T18:37:51Z)
Semiparametric Double Reinforcement Learning with Applications to Long-Term Causal Inference [33.14076284663493]
Long-term causal effects must be estimated from short-term data.<n>MDPs provide a natural framework for capturing such long-term dynamics.<n>Nonparametric implementations require strong intertemporal overlap assumptions.<n>We introduce a novel plug-in estimator based on isotonic Bellman calibration.
arXiv Detail & Related papers (2025-01-12T20:35:28Z)
Learning Controlled Stochastic Differential Equations [61.82896036131116]
This work proposes a novel method for estimating both drift and diffusion coefficients of continuous, multidimensional, nonlinear controlled differential equations with non-uniform diffusion. We provide strong theoretical guarantees, including finite-sample bounds for (L2), (Linfty), and risk metrics, with learning rates adaptive to coefficients' regularity. Our method is available as an open-source Python library.
arXiv Detail & Related papers (2024-11-04T11:09:58Z)
Statistical Inference for Temporal Difference Learning with Linear Function Approximation [55.80276145563105]
We investigate the statistical properties of Temporal Difference learning with Polyak-Ruppert averaging.<n>We make three theoretical contributions that improve upon the current state-of-the-art results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z)
MACE: An Efficient Model-Agnostic Framework for Counterfactual Explanation [132.77005365032468]
We propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE) In our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity. Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
arXiv Detail & Related papers (2022-05-31T04:57:06Z)
Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well. We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain. We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z)
Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator. We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
Double/Debiased Machine Learning for Dynamic Treatment Effects via g-Estimation [25.610534178373065]
We consider the estimation of treatment effects in settings when multiple treatments are assigned over time. We propose an extension of the double/debiased machine learning framework to estimate the dynamic effects of treatments.
arXiv Detail & Related papers (2020-02-17T22:32:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.