M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes
- URL: http://arxiv.org/abs/2410.06678v2
- Date: Tue, 15 Oct 2024 03:02:05 GMT
- Title: M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes
- Authors: Zeyu Zhang, Sixu Yan, Muzhi Han, Zaijin Wang, Xinggang Wang, Song-Chun Zhu, Hangxin Liu,
- Abstract summary: We propose M3Bench, a new benchmark of whole-body motion generation for mobile manipulation tasks.
M3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives.
M3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker.
- Score: 66.44171200767839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose M^3Bench, a new benchmark of whole-body motion generation for mobile manipulation tasks. Given a 3D scene context, M^3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives, then generate coordinated whole-body motion trajectories for object rearrangement tasks. M^3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M^3BenchMaker. This automatic data generation tool produces coordinated whole-body motion trajectories from high-level task instructions, requiring only basic scene and robot information. Our benchmark incorporates various task splits to assess generalization across different dimensions and leverages realistic physics simulation for trajectory evaluation. Through extensive experimental analyses, we reveal that state-of-the-art models still struggle with coordinated base-arm motion while adhering to environment-context and task-specific constraints, highlighting the need to develop new models that address this gap. Through M^3Bench, we aim to facilitate future robotics research towards more adaptive and capable mobile manipulation in diverse, real-world environments.
Related papers
- Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy [68.50785963043161]
GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.
We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.
3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
arXiv Detail & Related papers (2024-10-02T09:02:34Z) - Human-Object Interaction from Human-Level Instructions [16.70362477046958]
We present the first complete system that can synthesize object motion, full-body motion, and finger motion simultaneously from human-level instructions.
Our experiments demonstrate the effectiveness of our high-level planner in generating plausible target layouts and our low-level motion generator in synthesizing realistic interactions for diverse objects.
arXiv Detail & Related papers (2024-06-25T17:46:28Z) - Closed Loop Interactive Embodied Reasoning for Robot Manipulation [17.732550906162192]
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks.
We introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality Blender.
We propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.
arXiv Detail & Related papers (2024-04-23T16:33:28Z) - Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion
Planning Benchmark for Physically Realistic Embodied AI [96.86091264553613]
We introduce a visually-guided and physics-driven task-and-motion planning benchmark, which we call the ThreeDWorld Transport Challenge.
In this challenge, an embodied agent equipped with two 9-DOF articulated arms is spawned randomly in a simulated physical home environment.
The agent is required to find a small set of objects scattered around the house, pick them up, and transport them to a desired final location.
arXiv Detail & Related papers (2021-03-25T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.