Matrix Low-Rank Trust Region Policy Optimization
- URL: http://arxiv.org/abs/2405.17625v1
- Date: Mon, 27 May 2024 19:46:31 GMT
- Title: Matrix Low-Rank Trust Region Policy Optimization
- Authors: Sergio Rozada, Antonio G. Marques,
- Abstract summary: Methods in reinforcement learning use a Policy Gradient approach to learn a parametric policy that maps to actions.
Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements.
This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms.
- Score: 10.196333441334895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.
Related papers
- Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state.
This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy.
We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z) - Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation [48.92318828548911]
We present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps.
LoRa-PI learns an $varepsilon$-optimal policy using $widetildeO(S+Aover mathrmpoly (1-gamma)varepsilon2)$ samples where $S$ denotes the number of states (resp. actions) and $gamma$ the discount factor.
arXiv Detail & Related papers (2024-10-30T20:22:17Z) - Matrix Low-Rank Approximation For Policy Gradient Methods [10.196333441334895]
Estimating a policy that maps to actions is a central problem in reinforcement learning.
In this paper, we put forth low-rank matrix-based models to estimate efficiently the parameters of gradient Policy algorithms.
arXiv Detail & Related papers (2024-05-27T19:49:08Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.