Convex Programs and Lyapunov Functions for Reinforcement Learning: A
Unified Perspective on the Analysis of Value-Based Methods
- URL: http://arxiv.org/abs/2202.06922v1
- Date: Mon, 14 Feb 2022 18:32:57 GMT
- Title: Convex Programs and Lyapunov Functions for Reinforcement Learning: A
Unified Perspective on the Analysis of Value-Based Methods
- Authors: Xingang Guo, Bin Hu
- Abstract summary: Value-based methods play a fundamental role in Markov decision processes (MDPs) and reinforcement learning (RL)
We present a unified control-theoretic framework for analyzing valued-based methods such as value computation (VC), value iteration (VI), and temporal difference (TD) learning.
- Score: 3.9391112596932243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value-based methods play a fundamental role in Markov decision processes
(MDPs) and reinforcement learning (RL). In this paper, we present a unified
control-theoretic framework for analyzing valued-based methods such as value
computation (VC), value iteration (VI), and temporal difference (TD) learning
(with linear function approximation). Built upon an intrinsic connection
between value-based methods and dynamic systems, we can directly use existing
convex testing conditions in control theory to derive various convergence
results for the aforementioned value-based methods. These testing conditions
are convex programs in form of either linear programming (LP) or semidefinite
programming (SDP), and can be solved to construct Lyapunov functions in a
straightforward manner. Our analysis reveals some intriguing connections
between feedback control systems and RL algorithms. It is our hope that such
connections can inspire more work at the intersection of system/control theory
and RL.
Related papers
- Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Neural Lyapunov Differentiable Predictive Control [2.042924346801313]
We present a learning-based predictive control methodology using the differentiable programming framework with probabilistic Lyapunov-based stability guarantees.
In conjunction, our approach jointly learns a Lyapunov function that certifies the regions of state-space with stable dynamics.
arXiv Detail & Related papers (2022-05-22T03:52:27Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Extended Radial Basis Function Controller for Reinforcement Learning [3.42658286826597]
This paper proposes a hybrid reinforcement learning controller which dynamically interpolates a model-based linear controller and an arbitrary differentiable policy.
The linear controller is designed based on local linearised model knowledge, and stabilises the system in a neighbourhood about an operating point.
Learning has been done on both model-based (PILCO) and model-free (DDPG) frameworks.
arXiv Detail & Related papers (2020-09-12T20:56:48Z) - Single-step deep reinforcement learning for open-loop control of laminar
and turbulent flows [0.0]
This research gauges the ability of deep reinforcement learning (DRL) techniques to assist the optimization and control of fluid mechanical systems.
It combines a novel, "degenerate" version of the prototypical policy optimization (PPO) algorithm, that trains a neural network in optimizing the system only once per learning episode.
arXiv Detail & Related papers (2020-06-04T16:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.