Related papers: A Unified Framework for 3D Scene Understanding

A Unified Framework for 3D Scene Understanding

URL: http://arxiv.org/abs/2407.03263v1
Date: Wed, 3 Jul 2024 16:50:07 GMT
Title: A Unified Framework for 3D Scene Understanding
Authors: Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, Xiang Bai,
Abstract summary: UniSeg3D is a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. It facilitates inter-task knowledge sharing and promotes comprehensive 3D scene understanding. Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods.
Score: 50.6762892022386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose UniSeg3D, a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. Most previous 3D segmentation approaches are specialized for a specific task, thereby limiting their understanding of 3D scenes to a task-specific perspective. In contrast, the proposed method unifies six tasks into unified representations processed by the same Transformer. It facilitates inter-task knowledge sharing and, therefore, promotes comprehensive 3D scene understanding. To take advantage of multi-task unification, we enhance the performance by leveraging task connections. Specifically, we design a knowledge distillation method and a contrastive learning method to transfer task-specific knowledge across different tasks. Benefiting from extensive inter-task knowledge sharing, our UniSeg3D becomes more powerful. Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods, even those specialized for individual tasks. We hope UniSeg3D can serve as a solid unified baseline and inspire future work. The code will be available at https://dk-liang.github.io/UniSeg3D/.

Related papers

Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
Uni3DL: Unified Model for 3D and Language Understanding [41.74095171149082]
We present Uni3DL, a unified model for 3D and Language understanding. Uni3DL operates directly on point clouds. It has been rigorously evaluated across diverse 3D vision-language understanding tasks.
arXiv Detail & Related papers (2023-12-05T08:30:27Z)
A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z)
Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z)
A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks. These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
Joint 2D-3D Multi-Task Learning on Cityscapes-3D: 3D Detection, Segmentation, and Depth Estimation [11.608682595506354]
TaskPrompter presents an innovative multi-task prompting framework. It unifies the learning of (i) task-generic representations, (ii) task-specific representations, and (iii) cross-task interactions. New benchmark requires the multi-task model to concurrently generate predictions for monocular 3D vehicle detection, semantic segmentation, and monocular depth estimation.
arXiv Detail & Related papers (2023-04-03T13:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.