AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation
- URL: http://arxiv.org/abs/2509.25032v1
- Date: Mon, 29 Sep 2025 16:51:47 GMT
- Title: AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation
- Authors: Ryosuke Takanami, Petr Khrapchenkov, Shu Morikuni, Jumpei Arima, Yuta Takaba, Shunsuke Maeda, Takuya Okubo, Genki Sano, Satoshi Sekioka, Aoi Kadoya, Motonari Kambara, Naoya Nishiura, Haruto Suzuki, Takanori Yoshimoto, Koya Sakamoto, Shinnosuke Ono, Hu Yang, Daichi Yashima, Aoi Horo, Tomohiro Motoda, Kensuke Chiyoma, Hiroshi Ito, Koki Fukuda, Akihito Goto, Kazumi Morinaga, Yuya Ikeda, Riko Kawada, Masaki Yoshikawa, Norio Kosuge, Yuki Noguchi, Kei Ota, Tatsuya Matsushima, Yusuke Iwasawa, Yutaka Matsuo, Tetsuya Ogata,
- Abstract summary: AIRoA MoMa is a large-scale real-world multimodal dataset for mobile manipulation.<n>It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states.<n>The initial dataset comprises 25,469 episodes collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format.
- Score: 27.07279683330287
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As robots transition from controlled settings to unstructured human environments, building generalist agents that can reliably follow natural language instructions remains a central challenge. Progress in robust mobile manipulation requires large-scale multimodal datasets that capture contact-rich and long-horizon tasks, yet existing resources lack synchronized force-torque sensing, hierarchical annotations, and explicit failure cases. We address this gap with the AIRoA MoMa Dataset, a large-scale real-world multimodal dataset for mobile manipulation. It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states, together with a novel two-layer annotation schema of sub-goals and primitive actions for hierarchical learning and error analysis. The initial dataset comprises 25,469 episodes (approx. 94 hours) collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format. By uniquely integrating mobile manipulation, contact-rich interaction, and long-horizon structure, AIRoA MoMa provides a critical benchmark for advancing the next generation of Vision-Language-Action models. The first version of our dataset is now available at https://huggingface.co/datasets/airoa-org/airoa-moma .
Related papers
- RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization [31.40401674436269]
We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM to enable zero-shot deployment on novel embodiments for open-vocabulary tasks.<n>We collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI)<n>Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference.
arXiv Detail & Related papers (2026-02-03T09:38:23Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation [37.870170020889994]
We introduce MoMaGen, which formulates data generation as a constrained optimization problem.<n>We show it generates significantly more diverse datasets than existing methods.<n>MoMaGen can train successful imitation learning policies from a single source demonstration.
arXiv Detail & Related papers (2025-10-21T05:56:47Z) - HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning [46.57163859424286]
This paper presents HumanoidGen, an automated task creation and demonstration collection framework.<n>Specifically, we provide spatial annotations for both assets and dexterous hands based on the atomic operations.<n>In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data.
arXiv Detail & Related papers (2025-07-01T15:04:38Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z) - Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents [57.59830804627066]
We introduce MONDAY, a large-scale dataset of 313K annotated frames from 20K instructional videos capturing real-world mobile OS navigation.<n>Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities.<n>We present an automated framework that leverages publicly available video content to create comprehensive task datasets.
arXiv Detail & Related papers (2025-05-19T02:39:03Z) - Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction [5.989044517795631]
This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems.<n>The dataset first provides an integration of human,environment and robot data collection framework with 20 subjects and 30 interaction objects.<n>Fine-grained multi-level annotation based on absolute timestamp,and semantic segmentation labelling are performed.
arXiv Detail & Related papers (2025-03-07T08:28:24Z) - Multi-Robot Deep Reinforcement Learning for Mobile Navigation [82.62621210336881]
We propose a deep reinforcement learning algorithm with hierarchically integrated models (HInt)
At training time, HInt learns separate perception and dynamics models, and at test time, HInt integrates the two models in a hierarchical manner and plans actions with the integrated model.
Our mobile navigation experiments show that HInt outperforms conventional hierarchical policies and single-source approaches.
arXiv Detail & Related papers (2021-06-24T19:07:40Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.