Eventual Discounting Temporal Logic Counterfactual Experience Replay
- URL: http://arxiv.org/abs/2303.02135v1
- Date: Fri, 3 Mar 2023 18:29:47 GMT
- Title: Eventual Discounting Temporal Logic Counterfactual Experience Replay
- Authors: Cameron Voloshin, Abhinav Verma, Yisong Yue
- Abstract summary: The standard RL framework can be too myopic to find maximally satisfying policies.
We develop a new value-function based proxy, using a technique we call eventual discounting.
Second, we develop a new experience replay method for generating off-policy data.
- Score: 42.20459462725206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear temporal logic (LTL) offers a simplified way of specifying tasks for
policy optimization that may otherwise be difficult to describe with scalar
reward functions. However, the standard RL framework can be too myopic to find
maximally LTL satisfying policies. This paper makes two contributions. First,
we develop a new value-function based proxy, using a technique we call eventual
discounting, under which one can find policies that satisfy the LTL
specification with highest achievable probability. Second, we develop a new
experience replay method for generating off-policy data from on-policy rollouts
via counterfactual reasoning on different ways of satisfying the LTL
specification. Our experiments, conducted in both discrete and continuous
state-action spaces, confirm the effectiveness of our counterfactual experience
replay approach.
Related papers
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [40.93098780862429]
We show that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.
One first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning procedure.
We find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM from the preference data, and the ability of the downstream RL procedure to then filter its search space to the subset of policies that are optimal for
arXiv Detail & Related papers (2025-03-03T00:15:19Z) - DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL [59.01527054553122]
Linear temporal logic (LTL) has recently been adopted as a powerful formalism for specifying complex, temporally extended tasks.
Existing approaches suffer from several shortcomings.
We propose a novel learning approach to address these concerns.
arXiv Detail & Related papers (2024-10-06T21:30:38Z) - Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.
We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.
We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - LTLDoG: Satisfying Temporally-Extended Symbolic Constraints for Safe Diffusion-based Planning [12.839846486863308]
In this work, we focus on generating long-horizon trajectories that adhere to novel static and temporally-extended constraints/instructions at test time.
We propose a data-driven diffusion-based framework, finiteDoG, that modifies the inference steps of the reverse process given an instruction specified using linear temporal logic.
Experiments in robot navigation and manipulation illustrate that the method is able to generate trajectories that satisfy formulae that specify obstacle avoidance and visitation sequences.
arXiv Detail & Related papers (2024-05-07T11:54:22Z) - LTL-Constrained Policy Optimization with Cycle Experience Replay [19.43224037705577]
We introduce Cycle Replay (CyclER), a novel reward shaping technique that exploits the underlying structure of a constraint to guide a policy towards satisfaction.
We provide a theoretical guarantee that optimizing CyclER will achieve policies that satisfy the constraint with near-optimal probability.
Our experimental results show that optimizing CyclER in tandem with the existing scalar reward outperforms existing reward-shaping methods at finding performant-satisfying policies.
arXiv Detail & Related papers (2024-04-17T17:24:44Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Learning Minimally-Violating Continuous Control for Infeasible Linear
Temporal Logic Specifications [2.496282558123411]
This paper explores continuous-time control for target-driven navigation to satisfy complex high-level tasks expressed as linear temporal logic (LTL)
We propose a model-free synthesis framework using deep reinforcement learning (DRL) where the underlying dynamic system is unknown (an opaque box)
arXiv Detail & Related papers (2022-10-03T18:32:20Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Replay For Safety [51.11953997546418]
In experience replay, past transitions are stored in a memory buffer and re-used during learning.
We show that using an appropriate biased sampling scheme can allow us to achieve a emphsafe policy.
arXiv Detail & Related papers (2021-12-08T11:10:57Z) - Reinforcement Learning Based Temporal Logic Control with Maximum
Probabilistic Satisfaction [5.337302350000984]
This paper presents a model-free reinforcement learning algorithm to synthesize a control policy.
The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.
arXiv Detail & Related papers (2020-10-14T03:49:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.