Uni3DL: Unified Model for 3D and Language Understanding
- URL: http://arxiv.org/abs/2312.03026v1
- Date: Tue, 5 Dec 2023 08:30:27 GMT
- Title: Uni3DL: Unified Model for 3D and Language Understanding
- Authors: Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
- Abstract summary: We present Uni3DL, a unified model for 3D and Language understanding.
Uni3DL operates directly on point clouds.
It has been rigorously evaluated across diverse 3D vision-language understanding tasks.
- Score: 41.74095171149082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present Uni3DL, a unified model for 3D and Language
understanding. Distinct from existing unified vision-language models in 3D
which are limited in task variety and predominantly dependent on projected
multi-view images, Uni3DL operates directly on point clouds. This approach
significantly expands the range of supported tasks in 3D, encompassing both
vision and vision-language tasks in 3D. At the core of Uni3DL, a query
transformer is designed to learn task-agnostic semantic and mask outputs by
attending to 3D visual features, and a task router is employed to selectively
generate task-specific outputs required for diverse tasks. With a unified
architecture, our Uni3DL model enjoys seamless task decomposition and
substantial parameter sharing across tasks. Uni3DL has been rigorously
evaluated across diverse 3D vision-language understanding tasks, including
semantic segmentation, object detection, instance segmentation, visual
grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates
performance on par with or surpassing state-of-the-art (SOTA) task-specific
models. We hope our benchmark and Uni3DL model will serve as a solid step to
ease future research in unified models in the realm of 3D and language
understanding. Project page: https://uni3dl.github.io.
Related papers
- A Unified Framework for 3D Scene Understanding [50.6762892022386]
UniSeg3D is a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model.
It facilitates inter-task knowledge sharing and promotes comprehensive 3D scene understanding.
Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods.
arXiv Detail & Related papers (2024-07-03T16:50:07Z) - Unifying 3D Vision-Language Understanding via Promptable Queries [39.55438547712157]
unified model for 3D vision-language (3D-VL) understanding.
PQ3D is capable of using Promptable Queries to tackle a wide range of 3D-VL tasks.
Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks.
arXiv Detail & Related papers (2024-05-19T04:35:05Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts [30.571811801090224]
We introduce a comprehensive 3D instructionfollowing dataset called M3DBench.
It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts.
It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments.
arXiv Detail & Related papers (2023-12-17T16:53:30Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.