Related papers: FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

URL: http://arxiv.org/abs/2510.08022v1
Date: Thu, 09 Oct 2025 09:57:25 GMT
Title: FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset
Authors: Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li,
Abstract summary: We present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset.<n>FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data.<n>Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations.
Score: 55.66606167502093
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link https://github.com/MrKeee/FastUMI-100K.

Related papers

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents [57.59830804627066]
We introduce MONDAY, a large-scale dataset of 313K annotated frames from 20K instructional videos capturing real-world mobile OS navigation.<n>Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities.<n>We present an automated framework that leverages publicly available video content to create comprehensive task datasets.
arXiv Detail & Related papers (2025-05-19T02:39:03Z)
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM(3)) is a comprehensive framework designed to learn unified motion representations.<n>To enable large-scale training, we integrate and unify 11 high-quality motion datasets.<n>GenM(3) achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z)
Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction [5.989044517795631]
This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems.<n>The dataset first provides an integration of human,environment and robot data collection framework with 20 subjects and 30 interaction objects.<n>Fine-grained multi-level annotation based on absolute timestamp,and semantic segmentation labelling are performed.
arXiv Detail & Related papers (2025-03-07T08:28:24Z)
Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization [22.234170426206987]
We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning.<n>We validate the pipeline's effectiveness by training diffusion policies for challenging contact-rich manipulation tasks.<n>The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input.
arXiv Detail & Related papers (2025-02-27T18:56:01Z)
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation [47.41571121843972]
We introduce RoboMIND, a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes.<n>RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information.<n>Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction.
arXiv Detail & Related papers (2024-12-18T14:17:16Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark [63.878793340338035]
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. Existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting. We present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments.
arXiv Detail & Related papers (2024-03-29T15:08:37Z)
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations [55.549956643032836]
MimicGen is a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks.
arXiv Detail & Related papers (2023-10-26T17:17:31Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Scene-Aware Ambidextrous Bin Picking via Physics-based Metaverse Synthesis [72.85526892440251]
We introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis. The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper. We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties.
arXiv Detail & Related papers (2022-08-08T08:15:34Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.