Bingham Policy Parameterization for 3D Rotations in Reinforcement
Learning
- URL: http://arxiv.org/abs/2202.03957v1
- Date: Tue, 8 Feb 2022 16:09:02 GMT
- Title: Bingham Policy Parameterization for 3D Rotations in Reinforcement
Learning
- Authors: Stephen James, Pieter Abbeel
- Abstract summary: We propose a new policy parameterization for representing 3D rotations during reinforcement learning.
Our proposed Bingham Policy parameterization (BPP) models the Bingham distribution and allows for better rotation prediction.
We evaluate BPP on the rotation Wahba problem task, as well as a set of vision-based next-best pose robot manipulation tasks from RLBench.
- Score: 95.00518278458908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new policy parameterization for representing 3D rotations during
reinforcement learning. Today in the continuous control reinforcement learning
literature, many stochastic policy parameterizations are Gaussian. We argue
that universally applying a Gaussian policy parameterization is not always
desirable for all environments. One such case in particular where this is true
are tasks that involve predicting a 3D rotation output, either in isolation, or
coupled with translation as part of a full 6D pose output. Our proposed Bingham
Policy Parameterization (BPP) models the Bingham distribution and allows for
better rotation (quaternion) prediction over a Gaussian policy parameterization
in a range of reinforcement learning tasks. We evaluate BPP on the rotation
Wahba problem task, as well as a set of vision-based next-best pose robot
manipulation tasks from RLBench. We hope that this paper encourages more
research into developing other policy parameterization that are more suited for
particular environments, rather than always assuming Gaussian.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Subequivariant Graph Reinforcement Learning in 3D Environments [34.875774768800966]
We propose a novel setup for morphology-agnostic RL, dubbed Subequivariant Graph RL in 3D environments.
Specifically, we first introduce a new set of more practical yet challenging benchmarks in 3D space.
To optimize the policy over the enlarged state-action space, we propose to inject geometric symmetry.
arXiv Detail & Related papers (2023-05-30T11:34:57Z) - Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement
Learning with Domain Randomization [10.789649934346004]
We propose a sample-efficient method named cyclic policy distillation (CPD)
CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one.
All of the learned local policies are distilled into a global policy for sim-to-real transfers.
arXiv Detail & Related papers (2022-07-29T09:22:53Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Proximal Policy Optimization with Continuous Bounded Action Space via
the Beta Distribution [0.0]
In this work, we investigate how this Beta policy performs when it is trained by the Proximal Policy Optimization algorithm on two continuous control tasks from OpenAI gym.
For both tasks, the Beta policy is superior to the Gaussian policy in terms of agent's final expected reward, also showing more stability and faster convergence of the training process.
arXiv Detail & Related papers (2021-11-03T13:13:00Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Gaussian Process Policy Optimization [0.0]
We propose a novel actor-critic, model-free reinforcement learning algorithm.
It employs a Bayesian method of parameter space exploration to solve environments.
It is shown to be comparable to and at times empirically outperform current algorithms on environments that simulate robotic locomotion.
arXiv Detail & Related papers (2020-03-02T18:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.