Related papers: End-to-end RL Improves Dexterous Grasping Policies

End-to-end RL Improves Dexterous Grasping Policies

URL: http://arxiv.org/abs/2509.16434v1
Date: Fri, 19 Sep 2025 21:21:29 GMT
Title: End-to-end RL Improves Dexterous Grasping Policies
Authors: Ritvik Singh, Karl Van Wyk, Pieter Abbeel, Jitendra Malik, Nathan Ratliff, Ankur Handa,
Abstract summary: This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system.<n>We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality.
Score: 64.8476328230578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies.

Related papers

PICT -- A Differentiable, GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics [59.38498811984876]
We present our fluid simulator PICT, a differentiable pressure-implicit solver coded in PyTorch with Graphics-processing-unit (GPU) support.<n>We first verify the accuracy of both the forward simulation and our derived gradients in various established benchmarks.<n>We show that the gradients provided by our solver can be used to learn complicated turbulence models in 2D and 3D.
arXiv Detail & Related papers (2025-05-22T17:55:10Z)
Accelerating Visual-Policy Learning through Parallel Differentiable Simulation [3.70729078195191]
We propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients.<n>Our approach decouples the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems.<n> Notably, our method achieves a $4times$ improvement in final return, and successfully learns a humanoid running policy within 4 hours on a single GPU.
arXiv Detail & Related papers (2025-05-15T18:38:36Z)
Dream to Drive: Model-Based Vehicle Control Using Analytic World Models [67.20720048255362]
We present three new task setups that allow us to learn next state predictors, optimal planners, and optimal inverse states.<n>Unlike analytic policy (APG), which requires the gradient of the next simulator state with respect to the current actions, our proposed setups rely on the gradient of the next state with respect to the current state.
arXiv Detail & Related papers (2025-02-14T08:46:49Z)
SAPG: Split and Aggregate Policy Gradients [37.433915947580076]
We propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling. Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance.
arXiv Detail & Related papers (2024-07-29T17:59:50Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection [12.754974372231647]
We propose an adaptive deep-learning based VIO method that reduces computational redundancy by opportunistically disabling the visual modality. A Gumbel-Softmax trick is adopted to train the policy network to make the decision process differentiable for end-to-end system training. Experiment results show that our method achieves a similar or even better performance than the full-modality baseline.
arXiv Detail & Related papers (2022-05-12T16:17:49Z)
Distilled Domain Randomization [23.178141671320436]
We propose to combine reinforcement learning from randomized physics simulations with policy distillation. Our algorithm, called Distilled Domain Randomization (DiDoR), distills so-called teacher policies, which are experts on domains. This way, DiDoR learns controllers which transfer directly from simulation to reality, without requiring data from the target domain.
arXiv Detail & Related papers (2021-12-06T16:35:08Z)
PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning [84.30765628008207]
We propose a novel method, dubbed PlayVirtual, which augments cycle-consistent virtual trajectories to enhance the data efficiency for RL feature representation learning. Our method outperforms the current state-of-the-art methods by a large margin on both benchmarks.
arXiv Detail & Related papers (2021-06-08T07:37:37Z)
Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts. We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
Discrete-to-Deep Supervised Policy Learning [2.212418070140923]
This paper proposes Discrete-to-Deep Supervised Policy Learning (D2D-SPL) for training neural networks in reinforcement learning. D2D-SPL uses a single agent, needs no experience replay and learns much faster than state-of-the-art methods.
arXiv Detail & Related papers (2020-05-05T10:49:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.