Online Apprenticeship Learning
- URL: http://arxiv.org/abs/2102.06924v1
- Date: Sat, 13 Feb 2021 12:57:51 GMT
- Title: Online Apprenticeship Learning
- Authors: Lior Shani, Tom Zahavy and Shie Mannor
- Abstract summary: In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
- Score: 58.45089581278177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP)
without access to the cost function. Instead, we observe trajectories sampled
by an expert that acts according to some policy. The goal is to find a policy
that matches the expert's performance on some predefined set of cost functions.
We introduce an online variant of AL (Online Apprenticeship Learning; OAL),
where the agent is expected to perform comparably to the expert while
interacting with the environment. We show that the OAL problem can be
effectively solved by combining two mirror descent based no-regret algorithms:
one for policy optimization and another for learning the worst case cost. To
this end, we derive a convergent algorithm with $O(\sqrt{K})$ regret, where $K$
is the number of interactions with the MDP, and an additional linear error term
that depends on the amount of expert trajectories available. Importantly, our
algorithm avoids the need to solve an MDP at each iteration, making it more
practical compared to prior AL methods. Finally, we implement a deep variant of
our algorithm which shares some similarities to GAIL \cite{ho2016generative},
but where the discriminator is replaced with the costs learned by the OAL
problem. Our simulations demonstrate our theoretically grounded approach
outperforms the baselines.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Efficient Online Learning with Offline Datasets for Infinite Horizon
MDPs: A Bayesian Approach [25.77911741149966]
We show that if the learning agent models the behavioral policy used by the expert, it can do substantially better in terms of minimizing cumulative regret.
We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
arXiv Detail & Related papers (2023-10-17T19:01:08Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Provably Efficient Lifelong Reinforcement Learning with Linear Function
Approximation [41.460894569204065]
We study lifelong reinforcement learning (RL) in a regret setting of linear contextual Markov decision process (MDP)
We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks.
arXiv Detail & Related papers (2022-06-01T06:53:28Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Stochastic Gradient Descent with Dependent Data for Offline
Reinforcement Learning [4.421561004829125]
offline learning is useful in dealing with exploration-exploitation and enables data reuse in many applications.
In this work, we study two offline learning tasks: policy evaluation and policy learning.
arXiv Detail & Related papers (2022-02-06T20:54:36Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Offline Inverse Reinforcement Learning [24.316047317028147]
offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available.
Inspired by the success of IRL techniques in achieving state of the art imitation performances in online settings, we exploit GAN based data augmentation procedures to construct the first offline IRL algorithm.
arXiv Detail & Related papers (2021-06-09T13:44:06Z) - Online Model Selection for Reinforcement Learning with Function
Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret.
We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.