Related papers: MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

URL: http://arxiv.org/abs/2509.18757v1
Date: Tue, 23 Sep 2025 07:53:05 GMT
Title: MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning
Authors: Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, Fares Abu-Dakka,
Abstract summary: We present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera.<n>This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices.
Score: 3.079859911926098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.

Related papers

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z)
End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection [10.217810309422232]
We propose a framework that divides control between macro and micro motions.<n>A human operator guides the robot's arm pose through intuitive VR teleoperation.<n>An autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback.
arXiv Detail & Related papers (2025-10-31T16:12:02Z)
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation [56.996371714721995]
We propose an unsupervised approach that learns a highly compressed two-token state representation.<n>Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models.<n>We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation.
arXiv Detail & Related papers (2025-10-06T17:37:24Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Dynamic Non-Prehensile Object Transport via Model-Predictive Reinforcement Learning [24.079032278280447]
We propose an approach that combines batch reinforcement learning (RL) with model-predictive control (MPC)<n>We validate the proposed approach through extensive simulated and real-world experiments on a Franka Panda robot performing the robot waiter task.
arXiv Detail & Related papers (2024-11-27T03:33:42Z)
Learning Generalizable 3D Manipulation With 10 Demonstrations [16.502781729164973]
We present a novel framework that learns manipulation skills from as few as 10 demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
arXiv Detail & Related papers (2024-11-15T14:01:02Z)
Whole-Body Teleoperation for Mobile Manipulation at Zero Added Cost [8.71539730969424]
MoMa-Teleop is a novel teleoperation method that infers end-effector motions from existing interfaces.<n>We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks.
arXiv Detail & Related papers (2024-09-23T15:09:45Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.<n>We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.<n>Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.