Related papers: A Unified Query-based Paradigm for Point Cloud Understanding

A Unified Query-based Paradigm for Point Cloud Understanding

URL: http://arxiv.org/abs/2203.01252v2
Date: Thu, 3 Mar 2022 07:49:12 GMT
Title: A Unified Query-based Paradigm for Point Cloud Understanding
Authors: Zetong Yang, Li Jiang, Yanan Sun, Bernt Schiele, Jiaya Jia
Abstract summary: We present a novel Embedding-Querying paradigm (EQ-Paradigm) for 3D understanding tasks including detection, segmentation and classification. The input is encoded in the embedding stage with an arbitrary feature extraction architecture, which is independent of tasks and heads. This is achieved by introducing an intermediate representation, i.e., Q-representation, in the querying stage to serve as a bridge between the embedding stage and task heads.
Score: 116.30071021894317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D point cloud understanding is an important component in autonomous driving and robotics. In this paper, we present a novel Embedding-Querying paradigm (EQ-Paradigm) for 3D understanding tasks including detection, segmentation and classification. EQ-Paradigm is a unified paradigm that enables the combination of any existing 3D backbone architectures with different task heads. Under the EQ-Paradigm, the input is firstly encoded in the embedding stage with an arbitrary feature extraction architecture, which is independent of tasks and heads. Then, the querying stage enables the encoded features to be applicable for diverse task heads. This is achieved by introducing an intermediate representation, i.e., Q-representation, in the querying stage to serve as a bridge between the embedding stage and task heads. We design a novel Q-Net as the querying stage network. Extensive experimental results on various 3D tasks including semantic segmentation, object detection and shape classification show that EQ-Paradigm in tandem with Q-Net is a general and effective pipeline, which enables a flexible collaboration of backbones and heads, and further boosts the performance of the state-of-the-art methods. All codes and models will be published soon.

Related papers

Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z)
OmniFD: A Unified Model for Versatile Face Forgery Detection [45.17431538516313]
We introduce OmniFD, a unified framework that jointly addresses four core forgery detection tasks within a single model.<n>Our architecture consists of three principal components: (1) a shared Swin Transformer that extracts unified 4D-temporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries, and (3) lightweight decoding heads that transform refined representations into corresponding predictions.
arXiv Detail & Related papers (2025-11-30T22:36:42Z)
SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.<n>We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.<n>We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z)
UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation [9.275683880295874]
Scene Graph Generation (SGG) aims at identifying object entities and reasoning their relationships within a given image. One-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets. The challenge in one-stage methods stems from the issue of weak entanglement. We introduce UniQ, a Unified decoder with task-specific queries architecture.
arXiv Detail & Related papers (2025-01-10T03:38:16Z)
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks [8.921189024320919]
We investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former. Applying PEFT to the Q-Former achieves full fine-tuning using under 2% of the trainable parameters. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks.
arXiv Detail & Related papers (2024-10-12T10:51:05Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
A Unified Framework for 3D Scene Understanding [50.6762892022386]
UniSeg3D is a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. It facilitates inter-task knowledge sharing and promotes comprehensive 3D scene understanding. Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods.
arXiv Detail & Related papers (2024-07-03T16:50:07Z)
DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
OneFormer3D: One Transformer for Unified Point Cloud Segmentation [5.530212768657545]
This paper presents a unified, simple, and effective model addressing semantic, instance, and panoptic segmentation tasks jointly. The model, named OneFormer3D, performs instance and semantic segmentation consistently, using a group of learnable kernels. We also demonstrate the state-of-the-art results in semantic, instance, and panoptic segmentation of ScanNet, ScanNet200, and S3DIS datasets.
arXiv Detail & Related papers (2023-11-24T10:56:27Z)
Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.