OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D
Data
- URL: http://arxiv.org/abs/2310.13398v1
- Date: Fri, 20 Oct 2023 10:12:18 GMT
- Title: OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D
Data
- Authors: Yijie Zhou, Likun Cai, Xianhui Cheng, Zhongxue Gan, Xiangyang Xue, and
Wenchao Ding
- Abstract summary: We introduce OpenAnnotate3D, an open-source open-vocabulary auto-labeling system for vision and point cloud data.
Our system integrates the chain-of-thought capabilities of Large Language Models and the cross-modality capabilities of vision-language models.
- Score: 42.37939270236269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of big data and large models, automatic annotating functions for
multi-modal data are of great significance for real-world AI-driven
applications, such as autonomous driving and embodied AI. Unlike traditional
closed-set annotation, open-vocabulary annotation is essential to achieve
human-level cognition capability. However, there are few open-vocabulary
auto-labeling systems for multi-modal 3D data. In this paper, we introduce
OpenAnnotate3D, an open-source open-vocabulary auto-labeling system that can
automatically generate 2D masks, 3D masks, and 3D bounding box annotations for
vision and point cloud data. Our system integrates the chain-of-thought
capabilities of Large Language Models (LLMs) and the cross-modality
capabilities of vision-language models (VLMs). To the best of our knowledge,
OpenAnnotate3D is one of the pioneering works for open-vocabulary multi-modal
3D auto-labeling. We conduct comprehensive evaluations on both public and
in-house real-world datasets, which demonstrate that the system significantly
improves annotation efficiency compared to manual annotation while providing
accurate open-vocabulary auto-annotating results.
Related papers
- OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection [47.9080685468069]
We introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection.
OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM)
arXiv Detail & Related papers (2024-11-26T01:50:06Z) - Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking [73.05477052645885]
We introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.
We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes.
arXiv Detail & Related papers (2024-10-02T15:48:42Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation [46.998093729036334]
We propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D.
To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module.
To facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs.
arXiv Detail & Related papers (2024-01-21T04:13:58Z) - Unsupervised 3D Perception with 2D Vision-Language Distillation for
Autonomous Driving [39.70689418558153]
We present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels.
Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants.
arXiv Detail & Related papers (2023-09-25T19:33:52Z) - Large-Vocabulary 3D Diffusion Model with Transformer [57.076986347047]
We introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model.
We propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance.
arXiv Detail & Related papers (2023-09-14T17:59:53Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.