Provably Efficient Lifelong Reinforcement Learning with Linear Function
- URL:
- Date: Wed, 1 Jun 2022 06:53:28 GMT
- Title: Provably Efficient Lifelong Reinforcement Learning with Linear Function
- Authors: Sanae Amani, Lin F. Yang, Ching-An Cheng
- Abstract summary: We study lifelong reinforcement learning (RL) in a regret setting of linear contextual Markov decision process (MDP)
We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks.
- Score: 41.460894569204065
- License:
- Abstract: We study lifelong reinforcement learning (RL) in a regret minimization
setting of linear contextual Markov decision process (MDP), where the agent
needs to learn a multi-task policy while solving a streaming sequence of tasks.
We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that
provably achieves sublinear regret for any sequence of tasks, which may be
adaptively chosen based on the agent's past behaviors. Remarkably, our
algorithm uses only sublinear number of planning calls, which means that the
agent eventually learns a policy that is near optimal for multiple tasks (seen
or unseen) without the need of deliberate planning. A key to this property is a
new structural assumption that enables computation sharing across tasks during
exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm
has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on
$\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are
the feature dimensions of the dynamics and rewards, respectively. This
theoretical guarantee implies that our algorithm can enable a lifelong learning
agent to accumulate experiences and learn to rapidly solve new tasks.
Related papers
- Algorithm Design for Continual Learning in IoT Networks [16.35495567193046]
Continual learning (CL) is a new online learning technique over sequentially generated streaming data from different tasks.
In practical IoT networks, an autonomous vehicle to sample data and learn different tasks can route and alter the order of task pattern.
arXiv Detail & Related papers (2024-12-22T02:36:09Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Multi-task Representation Learning with Stochastic Linear Bandits [29.8208189270894]
We study the problem of transfer-learning in the setting of linear bandit tasks.
We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting.
arXiv Detail & Related papers (2022-02-21T09:26:34Z) - Provable Lifelong Learning of Representations [21.440845049501778]
We propose a provable lifelong learning algorithm that maintains and refines the internal feature representation.
We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation.
arXiv Detail & Related papers (2021-10-27T00:41:23Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.