SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments
- URL: http://arxiv.org/abs/2407.18913v2
- Date: Fri, 11 Oct 2024 15:35:01 GMT
- Title: SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments
- Authors: Shu Ishida, João F. Henriques,
- Abstract summary: This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options.
Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem.
- Score: 18.081732498034047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode is available. As an alternative approach, SOAP evaluates the policy gradient for an optimal option assignment. It extends the concept of the generalized advantage estimation (GAE) to propagate option advantages through time, which is an analytical equivalent to performing temporal back-propagation of option policy gradients. This option policy is only conditional on the history of the agent, not future actions. Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines. The open-sourced code is available at https://github.com/shuishida/SoapRL.
Related papers
- SePPO: Semi-Policy Preference Optimization for Diffusion Alignment [67.8738082040299]
We propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data.
We validate SePPO across both text-to-image and text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - A dynamical clipping approach with task feedback for Proximal Policy Optimization [29.855219523565786]
There is no theoretical proof that the optimal PPO clipping bound remains consistent throughout the entire training process.
Past studies have aimed to dynamically adjust PPO clipping bound to enhance PPO's performance.
We propose Preference based Proximal Policy Optimization (Pb-PPO) to better reflect the preference (maximizing Return) of reinforcement learning tasks.
arXiv Detail & Related papers (2023-12-12T06:35:56Z) - Multi-Task Option Learning and Discovery for Stochastic Path Planning [27.384742641275228]
This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon path planning problems.
Our approach computes useful options with policies as well as high-level paths that compose the discovered options.
We show that this approach yields strong guarantees of executability and solvability.
arXiv Detail & Related papers (2022-09-30T19:57:52Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Revisiting Design Choices in Proximal Policy Optimization [21.721075405670916]
Proximal Policy Optimization (PPO) is a popular deep policy algorithm gradient.
These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks.
We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO.
arXiv Detail & Related papers (2020-09-23T02:00:34Z) - On the Role of Weight Sharing During Deep Option Learning [21.216780543401235]
The options framework is a popular approach for building temporally extended actions in reinforcement learning.
Past work makes the key assumption that each of the components of option-critic has independent parameters.
We consider more general extensions of option-critic and hierarchical option-critic training that optimize for the full architecture with each update.
arXiv Detail & Related papers (2019-12-31T16:49:13Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.