Related papers: Unifying 3D Vision-Language Understanding via Promptable Queries

Unifying 3D Vision-Language Understanding via Promptable Queries

URL: http://arxiv.org/abs/2405.11442v2
Date: Wed, 24 Jul 2024 07:31:37 GMT
Title: Unifying 3D Vision-Language Understanding via Promptable Queries
Authors: Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li,
Abstract summary: unified model for 3D vision-language (3D-VL) understanding. PQ3D is capable of using Promptable Queries to tackle a wide range of 3D-VL tasks. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks.
Score: 39.55438547712157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

Related papers

3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering [52.01655676571933]
3D-MoRe is designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models.<n>The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder.<n>Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer pairs and 73,000 object descriptions.
arXiv Detail & Related papers (2025-07-16T08:38:26Z)
3D Question Answering via only 2D Vision-Language Models [87.41421075243103]
Large vision-language models (LVLMs) have advanced numerous fields.<n>We explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example.<n>Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question.<n>We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA.
arXiv Detail & Related papers (2025-05-28T09:04:39Z)
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment [14.535056813802527]
Testing-time Distribution Alignment (TeDA) is a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time.<n>TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings.<n>Experiments on four open-set 3D object retrieval benchmarks demonstrate TeDA greatly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-05T02:47:07Z)
SplatTalk: 3D VQA with Gaussian Splatting [13.211810095081159]
Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction. We introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM.
arXiv Detail & Related papers (2025-03-08T16:31:48Z)
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations. Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
Uni3DL: Unified Model for 3D and Language Understanding [41.74095171149082]
We present Uni3DL, a unified model for 3D and Language understanding. Uni3DL operates directly on point clouds. It has been rigorously evaluated across diverse 3D vision-language understanding tasks.
arXiv Detail & Related papers (2023-12-05T08:30:27Z)
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [44.00343134325925]
3D-VisTA is a pre-trained Transformer for 3D Vision and Text Alignment. ScanScribe is the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training.
arXiv Detail & Related papers (2023-08-08T15:59:17Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.