Related papers: Inverse Reinforcement Learning with the Average Reward Criterion

Inverse Reinforcement Learning with the Average Reward Criterion

URL: http://arxiv.org/abs/2305.14608v1
Date: Wed, 24 May 2023 01:12:08 GMT
Title: Inverse Reinforcement Learning with the Average Reward Criterion
Authors: Feiyang Wu, Jingyang Ke, Anqi Wu
Abstract summary: We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent.
Score: 3.719493310637464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known. This work alleviates this assumption by proposing an average-reward framework with efficient learning algorithms. We develop novel stochastic first-order methods to solve the IRL problem under the average-reward setting, which requires solving an Average-reward Markov Decision Process (AMDP) as a subproblem. To solve the subproblem, we develop a Stochastic Policy Mirror Descent (SPMD) method under general state and action spaces that needs $\mathcal{{O}}(1/\varepsilon)$ steps of gradient computation. Equipped with SPMD, we propose the Inverse Policy Mirror Descent (IPMD) method for solving the IRL problem with a $\mathcal{O}(1/\varepsilon^2)$ complexity. To the best of our knowledge, the aforementioned complexity results are new in IRL. Finally, we corroborate our analysis with numerical experiments using the MuJoCo benchmark and additional control tasks.

Related papers

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [4.899818550820576]
We propose a new algorithm for multi-agent reinforcement learning. We show that this learned policy converges to the optimal policy on the order of $tildeO (1/sqrtk)$ as the number of subsampled agents increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z)
Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback. We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based. Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z)
Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z)
Policy Mirror Descent Inherently Explores Action Space [10.772560347950053]
We establish for the first time an $tildemathcalO (1/epsilon2)$ sample complexity for online policy gradient methods without any exploration strategies. New policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.
arXiv Detail & Related papers (2023-03-08T05:19:08Z)
Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences. We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems. This stems from variance differences between policies, which causes two problems. We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z)
Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$. Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.