Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees
- URL: http://arxiv.org/abs/2210.01808v1
- Date: Tue, 4 Oct 2022 17:13:45 GMT
- Title: Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees
- Authors: Siliang Zeng, Chenliang Li, Alfredo Garcia, Mingyi Hong
- Abstract summary: Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
- Score: 56.848265937921354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse reinforcement learning (IRL) aims to recover the reward function and
the associated optimal policy that best fits observed sequences of states and
actions implemented by an expert. Many algorithms for IRL have an inherently
nested structure: the inner loop finds the optimal policy given parametrized
rewards while the outer loop updates the estimates towards optimizing a measure
of fit. For high dimensional environments such nested-loop structure entails a
significant computational burden. To reduce the computational burden of a
nested loop, novel methods such as SQIL [1] and IQ-Learn [2] emphasize policy
estimation at the expense of reward estimation accuracy. However, without
accurate estimated rewards, it is not possible to do counterfactual analysis
such as predicting the optimal policy under different environment dynamics
and/or learning new tasks. In this paper we develop a novel single-loop
algorithm for IRL that does not compromise reward estimation accuracy. In the
proposed algorithm, each policy improvement step is followed by a stochastic
gradient step for likelihood maximization. We show that the proposed algorithm
provably converges to a stationary solution with a finite-time guarantee. If
the reward is parameterized linearly, we show the identified solution
corresponds to the solution of the maximum entropy IRL problem. Finally, by
using robotics control problems in MuJoCo and their transfer settings, we show
that the proposed algorithm achieves superior performance compared with other
IRL and imitation learning benchmarks.
Related papers
- Structural Estimation of Markov Decision Processes in High-Dimensional
State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states.
This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized.
We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z) - Continuous-Time Fitted Value Iteration for Robust Policies [93.25997466553929]
Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics.
We propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI)
These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems.
arXiv Detail & Related papers (2021-10-05T11:33:37Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - An Efficient Algorithm for Deep Stochastic Contextual Bandits [10.298368632706817]
In contextual bandit problems, an agent selects an action based on certain observed context to maximize the reward over iterations.
Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and is trained by a gradient based method.
arXiv Detail & Related papers (2021-04-12T16:34:43Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Langevin Dynamics for Adaptive Inverse Reinforcement Learning of
Stochastic Gradient Algorithms [21.796874356469644]
Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response.
We present a generalized Langevin dynamics to estimate the reward function $R(theta)$.
The proposed IRL algorithms use kernel-based passive learning schemes and generate samples from the distribution proportional to $exp(R(theta)$.
arXiv Detail & Related papers (2020-06-20T23:12:11Z) - A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning [32.91450388566405]
We develop a new Proximal Hybrid Policy Gradient Algorithm (ProxHSPGA)
We prove that both algorithms can achieve the best-known trajectory complexity $mathcalOleft(varepsilon-4right)$
We evaluate the performance of our algorithm on several well-known examples in reinforcement learning.
arXiv Detail & Related papers (2020-03-01T07:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.