Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation
- URL: http://arxiv.org/abs/2502.04139v1
- Date: Thu, 06 Feb 2025 15:19:48 GMT
- Title: Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation
- Authors: Jiahao Lu, Jiacheng Deng, Tianzhu Zhang,
- Abstract summary: 3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels.
transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance.
- Score: 33.58208166717537
- License:
- Abstract: 3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.
Related papers
- SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation [14.214197948110115]
This paper introduces a novel method, named SGIFormer, for 3D instance segmentation.
It is composed of the Semantic-guided Mix Query (SMQ) and the Geometric-enhanced Interleaving Transformer (GIT) decoder.
It attains state-of-the-art performance on ScanNet V2, ScanNet200, and the challenging high-fidelity ScanNet++ benchmark.
arXiv Detail & Related papers (2024-07-16T10:17:28Z) - Mask-Attention-Free Transformer for 3D Instance Segmentation [68.29828726317723]
transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved.
We develop a series of position-aware designs to overcome the low-recall issue and perform cross-attention by imposing positional prior.
Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets.
arXiv Detail & Related papers (2023-09-04T16:09:28Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - OcTr: Octree-based Transformer for 3D Object Detection [30.335788698814444]
A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes.
We propose an Octree-based Transformer, named OcTr, to address this issue.
For enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask.
arXiv Detail & Related papers (2023-03-22T15:01:20Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Learning Local Displacements for Point Cloud Completion [93.54286830844134]
We propose a novel approach aimed at object and semantic scene completion from a partial scan represented as a 3D point cloud.
Our architecture relies on three novel layers that are used successively within an encoder-decoder structure.
We evaluate both architectures on object and indoor scene completion tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-03-30T18:31:37Z) - TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer
Embedding Network [18.910883028990998]
We propose a triplet transformer embedding module to enhance multi-level features.
It is the first to use three transformer encoders with shared weights to enhance multi-level features.
The proposed triplet transformer embedding network (TriTransNet) achieves the state-of-the-art performance in RGB-D salient object detection.
arXiv Detail & Related papers (2021-08-09T12:42:56Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic
Convolution [136.7261709896713]
We propose a data-driven approach that generates the appropriate convolution kernels to apply in response to the nature of the instances.
The proposed method achieves promising results on both ScanetNetV2 and S3DIS.
It also improves inference speed by more than 25% over the current state-of-the-art.
arXiv Detail & Related papers (2020-11-26T14:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.