Related papers: End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection

End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection

URL: http://arxiv.org/abs/2511.00139v1
Date: Fri, 31 Oct 2025 16:12:02 GMT
Title: End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection
Authors: Yu Cui, Yujian Zhang, Lina Tao, Yang Li, Xinyu Yi, Zhibin Li,
Abstract summary: We propose a framework that divides control between macro and micro motions.<n>A human operator guides the robot's arm pose through intuitive VR teleoperation.<n>An autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback.
Score: 10.217810309422232
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Achieving human-like dexterous manipulation remains a major challenge for general-purpose robots. While Vision-Language-Action (VLA) models show potential in learning skills from demonstrations, their scalability is limited by scarce high-quality training data. Existing data collection methods face inherent constraints: manual teleoperation overloads human operators, while automated planning often produces unnatural motions. We propose a Shared Autonomy framework that divides control between macro and micro motions. A human operator guides the robot's arm pose through intuitive VR teleoperation, while an autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback. This division significantly reduces cognitive load and enables efficient collection of high-quality coordinated arm-hand demonstrations. Using this data, we train an end-to-end VLA policy enhanced with our novel Arm-Hand Feature Enhancement module, which captures both distinct and shared representations of macro and micro movements for more natural coordination. Our Corrective Teleoperation system enables continuous policy improvement through human-in-the-loop failure recovery. Experiments demonstrate that our framework generates high-quality data with minimal manpower and achieves a 90% success rate across diverse objects, including unseen instances. Comprehensive evaluations validate the system's effectiveness in developing dexterous manipulation capabilities.

Related papers

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z)
MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z)
METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model [36.82365894983052]
A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills.<n>We propose METIS, a vision-language-action model for dexterous manipulation pretrained on egocentric datasets.<n>Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks.
arXiv Detail & Related papers (2025-11-21T16:32:36Z)
ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations [32.570602111692914]
We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demonstrations to robots capable of complex bimanual manipulation.<n>ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot's end-effectors.<n>By recording an operator's deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation.
arXiv Detail & Related papers (2025-10-02T02:44:21Z)
MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning [3.079859911926098]
We present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera.<n>This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices.
arXiv Detail & Related papers (2025-09-23T07:53:05Z)
The Role of Embodiment in Intuitive Whole-Body Teleoperation for Mobile Manipulation [20.65893345441958]
A strong sense of embodiment combined with minimal physical and cognitive demands helps maintain data quality over extended periods.<n>We evaluate two visual feedback mechanisms: immersive virtual reality and conventional screen-based visualization of the robot's field of view.<n>Our results show that the use of VR as a feedback modality increases task completion time, cognitive workload, and perceived effort of the teleoperator.
arXiv Detail & Related papers (2025-09-03T11:25:36Z)
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition [48.65867987106428]
We introduce a novel system for joint learning between human operators and robots. It enables human operators to share control of a robot end-effector with a learned assistive agent. It reduces the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks.
arXiv Detail & Related papers (2024-06-29T03:37:29Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.