LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning
- URL: http://arxiv.org/abs/2303.00599v1
- Date: Wed, 1 Mar 2023 15:46:12 GMT
- Title: LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning
- Authors: Firas Al-Hafez, Davide Tateo, Oleg Arenz, Guoping Zhao, Jan Peters
- Abstract summary: We show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms.
We show that our method, Least Squares Inverse Q-Learning, outperforms state-of-the-art algorithms, particularly in environments with absorbing states.
- Score: 30.4251858001151
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent methods for imitation learning directly learn a $Q$-function using an
implicit reward formulation rather than an explicit reward function. However,
these methods generally require implicit reward regularization to improve
stability and often mistreat absorbing states. Previous works show that a
squared norm regularization on the implicit reward function is effective, but
do not provide a theoretical analysis of the resulting properties of the
algorithms. In this work, we show that using this regularizer under a mixture
distribution of the policy and the expert provides a particularly illuminating
perspective: the original objective can be understood as squared Bellman error
minimization, and the corresponding optimization problem minimizes a bounded
$\chi^2$-Divergence between the expert and the mixture distribution. This
perspective allows us to address instabilities and properly treat absorbing
states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ),
outperforms state-of-the-art algorithms, particularly in environments with
absorbing states. Finally, we propose to use an inverse dynamics model to learn
from observations only. Using this approach, we retain performance in settings
where no expert actions are available.
Related papers
- Iterative Reweighted Least Squares Networks With Convergence Guarantees
for Solving Inverse Imaging Problems [12.487990897680422]
We present a novel optimization strategy for image reconstruction tasks under analysis-based image regularization.
We parameterize such regularizers using potential functions that correspond to weighted extensions of the $ell_pp$-vector and $mathcalS_pp$ Schatten-matrix quasi-norms.
We show that thanks to the convergence guarantees of our proposed minimization strategy, such optimization can be successfully performed with a memory-efficient implicit back-propagation scheme.
arXiv Detail & Related papers (2023-08-10T17:59:46Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Robust Imitation via Mirror Descent Inverse Reinforcement Learning [18.941048578572577]
This paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem.
We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence.
Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.
arXiv Detail & Related papers (2022-10-20T12:25:21Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.