Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement
Learning in Discounted Linear MDPs
- URL: http://arxiv.org/abs/2310.11515v1
- Date: Tue, 17 Oct 2023 18:27:27 GMT
- Title: Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement
Learning in Discounted Linear MDPs
- Authors: Yu-Heng Hung, Ping-Chun Hsieh, Akshay Mete, P. R. Kumar
- Abstract summary: We propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE)
VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step.
In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct.
- Score: 16.006893624836554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the infinite-horizon linear Markov Decision Processes (MDPs),
where the transition probabilities of the dynamic model can be linearly
parameterized with the help of a predefined low-dimensional feature mapping.
While the existing regression-based approaches have been theoretically shown to
achieve nearly-optimal regret, they are computationally rather inefficient due
to the need for a large number of optimization runs in each time step,
especially when the state and action spaces are large. To address this issue,
we propose to solve linear MDPs through the lens of Value-Biased Maximum
Likelihood Estimation (VBMLE), which is a classic model-based exploration
principle in the adaptive control literature for resolving the well-known
closed-loop identification problem of Maximum Likelihood Estimation. We
formally show that (i) VBMLE enjoys $\widetilde{O}(d\sqrt{T})$ regret, where
$T$ is the time horizon and $d$ is the dimension of the model parameter, and
(ii) VBMLE is computationally more efficient as it only requires solving one
optimization problem in each time step. In our regret analysis, we offer a
generic convergence result of MLE in linear MDPs through a novel
supermartingale construct and uncover an interesting connection between linear
MDPs and online learning, which could be of independent interest. Finally, the
simulation results show that VBMLE significantly outperforms the benchmark
method in terms of both empirical regret and computation time.
Related papers
- Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Exact and general decoupled solutions of the LMC Multitask Gaussian Process model [28.32223907511862]
The Linear Model of Co-regionalization (LMC) is a very general model of multitask gaussian process for regression or classification.
Recent work has shown that under some conditions the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes.
We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model.
arXiv Detail & Related papers (2023-10-18T15:16:24Z) - Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL)
We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy.
We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning [77.34726150561087]
We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
arXiv Detail & Related papers (2021-10-07T02:36:14Z) - Improved Prediction and Network Estimation Using the Monotone Single
Index Multi-variate Autoregressive Model [34.529641317832024]
We develop a semi-parametric approach based on the monotone single-index multi-variate autoregressive model (SIMAM)
We provide theoretical guarantees for dependent data and an alternating projected gradient descent algorithm.
We demonstrate the superior performance both on simulated data and two real data examples.
arXiv Detail & Related papers (2021-06-28T12:32:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.