Eventual Discounting Temporal Logic Counterfactual Experience Replay
- URL: http://arxiv.org/abs/2303.02135v1
- Date: Fri, 3 Mar 2023 18:29:47 GMT
- Title: Eventual Discounting Temporal Logic Counterfactual Experience Replay
- Authors: Cameron Voloshin, Abhinav Verma, Yisong Yue
- Abstract summary: The standard RL framework can be too myopic to find maximally satisfying policies.
We develop a new value-function based proxy, using a technique we call eventual discounting.
Second, we develop a new experience replay method for generating off-policy data.
- Score: 42.20459462725206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear temporal logic (LTL) offers a simplified way of specifying tasks for
policy optimization that may otherwise be difficult to describe with scalar
reward functions. However, the standard RL framework can be too myopic to find
maximally LTL satisfying policies. This paper makes two contributions. First,
we develop a new value-function based proxy, using a technique we call eventual
discounting, under which one can find policies that satisfy the LTL
specification with highest achievable probability. Second, we develop a new
experience replay method for generating off-policy data from on-policy rollouts
via counterfactual reasoning on different ways of satisfying the LTL
specification. Our experiments, conducted in both discrete and continuous
state-action spaces, confirm the effectiveness of our counterfactual experience
replay approach.
Related papers
- Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.
We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.
We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - LTLDoG: Satisfying Temporally-Extended Symbolic Constraints for Safe Diffusion-based Planning [12.839846486863308]
In this work, we focus on generating long-horizon trajectories that adhere to novel static and temporally-extended constraints/instructions at test time.
We propose a data-driven diffusion-based framework, finiteDoG, that modifies the inference steps of the reverse process given an instruction specified using linear temporal logic.
Experiments in robot navigation and manipulation illustrate that the method is able to generate trajectories that satisfy formulae that specify obstacle avoidance and visitation sequences.
arXiv Detail & Related papers (2024-05-07T11:54:22Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Learning Minimally-Violating Continuous Control for Infeasible Linear
Temporal Logic Specifications [2.496282558123411]
This paper explores continuous-time control for target-driven navigation to satisfy complex high-level tasks expressed as linear temporal logic (LTL)
We propose a model-free synthesis framework using deep reinforcement learning (DRL) where the underlying dynamic system is unknown (an opaque box)
arXiv Detail & Related papers (2022-10-03T18:32:20Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Replay For Safety [51.11953997546418]
In experience replay, past transitions are stored in a memory buffer and re-used during learning.
We show that using an appropriate biased sampling scheme can allow us to achieve a emphsafe policy.
arXiv Detail & Related papers (2021-12-08T11:10:57Z) - Reinforcement Learning Based Temporal Logic Control with Maximum
Probabilistic Satisfaction [5.337302350000984]
This paper presents a model-free reinforcement learning algorithm to synthesize a control policy.
The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.
arXiv Detail & Related papers (2020-10-14T03:49:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.