Sample Efficient Model-free Reinforcement Learning from LTL
Specifications with Optimality Guarantees
- URL: http://arxiv.org/abs/2305.01381v2
- Date: Wed, 3 May 2023 12:47:09 GMT
- Title: Sample Efficient Model-free Reinforcement Learning from LTL
Specifications with Optimality Guarantees
- Authors: Daqian Shao and Marta Kwiatkowska
- Abstract summary: We present a model-free Reinforcement Learning (RL) approach that efficiently learns an optimal policy for an unknown system.
We also provide improved theoretical results on choosing the key parameters to ensure optimality.
- Score: 17.69385864791265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear Temporal Logic (LTL) is widely used to specify high-level objectives
for system policies, and it is highly desirable for autonomous systems to learn
the optimal policy with respect to such specifications. However, learning the
optimal policy from LTL specifications is not trivial. We present a model-free
Reinforcement Learning (RL) approach that efficiently learns an optimal policy
for an unknown stochastic system, modelled using Markov Decision Processes
(MDPs). We propose a novel and more general product MDP, reward structure and
discounting mechanism that, when applied in conjunction with off-the-shelf
model-free RL algorithms, efficiently learn the optimal policy that maximizes
the probability of satisfying a given LTL specification with optimality
guarantees. We also provide improved theoretical results on choosing the key
parameters in RL to ensure optimality. To directly evaluate the learned policy,
we adopt probabilistic model checker PRISM to compute the probability of the
policy satisfying such specifications. Several experiments on various tabular
MDP environments across different LTL tasks demonstrate the improved sample
efficiency and optimal policy convergence.
Related papers
- Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-task Optimum Comparator [9.900800253949512]
We develop a bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation.
We empirically validate the correctness of the derived upper bounds and demonstrate the superior effectiveness of the proposed algorithm over benchmarks.
arXiv Detail & Related papers (2024-10-13T05:17:58Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Optimistic Natural Policy Gradient: a Simple Efficient Policy
Optimization Framework for Online RL [23.957148537567146]
This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL.
For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $varepsilon$-optimal policy within $tildeO(d2/varepsilon3)$ samples.
arXiv Detail & Related papers (2023-05-18T15:19:26Z) - Towards Deployment-Efficient Reinforcement Learning: Lower Bound and
Optimality [141.89413461337324]
Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL)
We propose a theoretical formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective.
arXiv Detail & Related papers (2022-02-14T01:31:46Z) - Tailored neural networks for learning optimal value functions in MPC [0.0]
Learning-based predictive control is a promising alternative to optimization-based MPC.
In this paper, we provide a similar result for representing the optimal value function and the Q-function that are both known to be piecewise quadratic for linear MPC.
arXiv Detail & Related papers (2021-12-07T20:34:38Z) - Model-Free Learning of Safe yet Effective Controllers [11.876140218511157]
We study the problem of learning safe control policies that are also effective.
We propose a model-free reinforcement learning algorithm that learns a policy that first maximizes the probability of ensuring the safety.
arXiv Detail & Related papers (2021-03-26T17:05:12Z) - Reinforcement Learning Based Temporal Logic Control with Maximum
Probabilistic Satisfaction [5.337302350000984]
This paper presents a model-free reinforcement learning algorithm to synthesize a control policy.
The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.
arXiv Detail & Related papers (2020-10-14T03:49:16Z) - Certified Reinforcement Learning with Logic Guidance [78.2286146954051]
We propose a model-free RL algorithm that enables the use of Linear Temporal Logic (LTL) to formulate a goal for unknown continuous-state/action Markov Decision Processes (MDPs)
The algorithm is guaranteed to synthesise a control policy whose traces satisfy the specification with maximal probability.
arXiv Detail & Related papers (2019-02-02T20:09:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.