EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI
- URL: http://arxiv.org/abs/2312.16170v1
- Date: Tue, 26 Dec 2023 18:59:11 GMT
- Title: EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI
- Authors: Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen
Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua
Lin, Jiangmiao Pang
- Abstract summary: EmbodiedScan is a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.
It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories.
Building upon this database, we introduce a baseline framework named Embodied Perceptron.
It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities.
- Score: 88.03089807278188
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the realm of computer vision and robotics, embodied agents are expected to
explore their environment and carry out human instructions. This necessitates
the ability to fully understand 3D scenes given their first-person observations
and contextualize them into language for interaction. However, traditional
research focuses more on scene-level input and output setups from a global
view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric
3D perception dataset and benchmark for holistic 3D scene understanding. It
encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language
prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which
partially align with LVIS, and dense semantic occupancy with 80 common
categories. Building upon this database, we introduce a baseline framework
named Embodied Perceptron. It is capable of processing an arbitrary number of
multi-modal inputs and demonstrates remarkable 3D perception capabilities, both
within the two series of benchmarks we set up, i.e., fundamental 3D perception
tasks and language-grounded tasks, and in the wild. Codes, datasets, and
benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.
Related papers
- 3D Feature Distillation with Object-Centric Priors [9.626027459292926]
2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images.
Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific or focus on indoor room scan data.
We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency.
arXiv Detail & Related papers (2024-06-26T20:16:49Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents.
We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes.
We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z) - 3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.