Related papers: VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning

URL: http://arxiv.org/abs/2510.14930v2
Date: Sat, 18 Oct 2025 14:47:51 GMT
Title: VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning
Authors: Binghao Huang, Jie Xu, Iretiayo Akinola, Wei Yang, Balakumar Sundaralingam, Rowland O'Flaherty, Dieter Fox, Xiaolong Wang, Arsalan Mousavian, Yu-Wei Chao, Yunzhu Li,
Abstract summary: Humans excel at bimanual assembly tasks by adapting to rich tactile feedback.<n>We present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning.
Score: 39.49846628626501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.

Related papers

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z)
Closing the Reality Gap: Zero-Shot Sim-to-Real Deployment for Dexterous Force-Based Grasping and Manipulation [12.509181374985936]
Human-like dexterous hands with multiple fingers offer human-level manipulation capabilities.<n>But training control policies that can directly deploy on real hardware remains difficult due to contact-rich physics.<n>We present a practical framework that utilizes dense tactile feedback combined with joint torque sensing to regulate physical interactions.
arXiv Detail & Related papers (2026-01-06T07:26:39Z)
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning [72.43357471969564]
Isaac Lab combines high-fidelity GPU parallel physics, rendering, and a modular, composable architecture for designing environments and training robot policies.<n>We highlight its application to a diverse set of challenges, including whole-body control, cross-embodiment mobility, contact-rich and dexterous manipulation, and the integration of human demonstrations for skill acquisition.<n>We believe Isaac Lab's combination of advanced simulation capabilities, rich sensing, and data-center scale execution will help unlock the next generation of breakthroughs in robotics research.
arXiv Detail & Related papers (2025-11-06T21:43:02Z)
Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper [7.618517580705364]
We present a portable, lightweight gripper with integrated tactile sensors.<n>We propose a cross-modal representation learning framework that integrates visual and tactile signals.<n>We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer.
arXiv Detail & Related papers (2025-07-20T17:53:59Z)
The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio [138.07247714782412]
MultiGen is a framework that integrates large-scale generative models into traditional physics simulators.<n>We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids.
arXiv Detail & Related papers (2025-07-03T17:59:58Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
DiffSim2Real: Deploying Quadrupedal Locomotion Policies Purely Trained in Differentiable Simulation [35.76143996968696]
We show that locomotion policies trained with analytic gradients from a differentiable simulator can be successfully transferred to the real world. A key factor in our success is a smooth contact model that combines informative gradients with physical accuracy. This is the first time a real quadpedal robot is able to locomote after training exclusively in a differentiable simulation.
arXiv Detail & Related papers (2024-11-04T15:43:57Z)
Visual-Tactile Sensing for In-Hand Object Reconstruction [38.42487660352112]
We propose a visual-tactile in-hand object reconstruction framework textbfVTacO, and extend it to textbfVTacOH for hand-object reconstruction. A simulation environment, VT-Sim, supports generating hand-object interaction for both rigid and deformable objects.
arXiv Detail & Related papers (2023-03-25T15:16:31Z)
Residual Physics Learning and System Identification for Sim-to-real Transfer of Policies on Buoyancy Assisted Legged Robots [14.760426243769308]
In this work, we demonstrate robust sim-to-real transfer of control policies on the BALLU robots via system identification. Rather than relying on standard supervised learning formulations, we utilize deep reinforcement learning to train an external force policy. We analyze the improved simulation fidelity by comparing the simulation trajectories against the real-world ones.
arXiv Detail & Related papers (2023-03-16T18:49:05Z)
DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality [64.51295032956118]
We train a policy that can perform robust dexterous manipulation on an anthropomorphic robot hand. Our work reaffirms the possibilities of sim-to-real transfer for dexterous manipulation in diverse kinds of hardware and simulator setups.
arXiv Detail & Related papers (2022-10-25T01:51:36Z)
TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors [74.67698916175614]
We propose TrafficSim, a multi-agent behavior model for realistic traffic simulation. In particular, we leverage an implicit latent variable model to parameterize a joint actor policy. We show TrafficSim generates significantly more realistic and diverse traffic scenarios as compared to a diverse set of baselines.
arXiv Detail & Related papers (2021-01-17T00:29:30Z)
Learning the sense of touch in simulation: a sim-to-real strategy for vision-based tactile sensing [1.9981375888949469]
This paper focuses on a vision-based tactile sensor, which aims to reconstruct the distribution of the three-dimensional contact forces applied on its soft surface. A strategy is proposed to train a tailored deep neural network entirely from the simulation data. The resulting learning architecture is directly transferable across multiple tactile sensors without further training and yields accurate predictions on real data.
arXiv Detail & Related papers (2020-03-05T14:17:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.