MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
- URL: http://arxiv.org/abs/2510.18316v1
- Date: Tue, 21 Oct 2025 05:56:47 GMT
- Title: MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
- Authors: Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, Weiyu Liu, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei,
- Abstract summary: We introduce MoMaGen, which formulates data generation as a constrained optimization problem.<n>We show it generates significantly more diverse datasets than existing methods.<n>MoMaGen can train successful imitation learning policies from a single source demonstration.
- Score: 37.870170020889994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen.github.io.
Related papers
- Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping [44.348686148716894]
We introduce Tether, a method for autonomous functional play involving structured, task-directed interactions.<n>First, we design a novel open-loop policy that warps actions from a small set of source demonstrations.<n>Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement.
arXiv Detail & Related papers (2026-03-03T18:59:07Z) - Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos [56.510263910611684]
We tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions.<n>Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors.<n>We present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data.
arXiv Detail & Related papers (2026-02-13T18:59:10Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation [27.07279683330287]
AIRoA MoMa is a large-scale real-world multimodal dataset for mobile manipulation.<n>It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states.<n>The initial dataset comprises 25,469 episodes collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format.
arXiv Detail & Related papers (2025-09-29T16:51:47Z) - MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning [3.079859911926098]
We present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera.<n>This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices.
arXiv Detail & Related papers (2025-09-23T07:53:05Z) - Manipulate-Anything: Automating Real-World Robots using Vision-Language Models [47.16659229389889]
We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation.
Manipulate-Anything can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object.
arXiv Detail & Related papers (2024-06-27T06:12:01Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Error-Aware Imitation Learning from Teleoperation Data for Mobile
Manipulation [54.31414116478024]
In mobile manipulation (MM), robots can both navigate within and interact with their environment.
In this work, we explore how to apply imitation learning (IL) to learn continuous visuo-motor policies for MM tasks.
arXiv Detail & Related papers (2021-12-09T23:54:59Z) - Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query.
Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories.
We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.