Related papers: SAT: Size-Aware Transformer for 3D Point Cloud Semantic Segmentation

SAT: Size-Aware Transformer for 3D Point Cloud Semantic Segmentation

URL: http://arxiv.org/abs/2301.06869v1
Date: Tue, 17 Jan 2023 13:25:11 GMT
Title: SAT: Size-Aware Transformer for 3D Point Cloud Semantic Segmentation
Authors: Junjie Zhou, Yongping Xiong, Chinwai Chiu, Fangyu Liu, Xiangyang Gong
Abstract summary: We propose the Size-Aware Transformer (SAT) that can tailor effective receptive fields for objects of different sizes. Our SAT achieves size-aware learning via two steps: introduce multi-scale features to each attention layer and allow each point to choose its attentive fields adaptively.
Score: 6.308766374923878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models have achieved promising performances in point cloud segmentation. However, most existing attention schemes provide the same feature learning paradigm for all points equally and overlook the enormous difference in size among scene objects. In this paper, we propose the Size-Aware Transformer (SAT) that can tailor effective receptive fields for objects of different sizes. Our SAT achieves size-aware learning via two steps: introduce multi-scale features to each attention layer and allow each point to choose its attentive fields adaptively. It contains two key designs: the Multi-Granularity Attention (MGA) scheme and the Re-Attention module. The MGA addresses two challenges: efficiently aggregating tokens from distant areas and preserving multi-scale features within one attention layer. Specifically, point-voxel cross attention is proposed to address the first challenge, and the shunted strategy based on the standard multi-head self attention is applied to solve the second. The Re-Attention module dynamically adjusts the attention scores to the fine- and coarse-grained features output by MGA for each point. Extensive experimental results demonstrate that SAT achieves state-of-the-art performances on S3DIS and ScanNetV2 datasets. Our SAT also achieves the most balanced performance on categories among all referred methods, which illustrates the superiority of modelling categories of different sizes. Our code and model will be released after the acceptance of this paper.

Related papers

PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models [80.65273820998875]
We present PanMatch, a versatile foundation model for robust correspondence matching.<n>Our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework.<n>PanMatch achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities.
arXiv Detail & Related papers (2025-07-11T08:18:52Z)
Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images [38.942152581251165]
We propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details.<n>In particular, we design the global-local mixed attention, which captures diverse information through two perspectives.<n>We present a gated linear feed-forward network to increase the expressive ability.
arXiv Detail & Related papers (2025-06-27T02:40:48Z)
DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement [7.9006143460465355]
This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects. It surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes.
arXiv Detail & Related papers (2025-03-27T06:08:24Z)
Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network [80.19054069988559]
We find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency. We propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth representation in two aspects. Experiments show that our method achieves significant improvements on three widely used benchmarks.
arXiv Detail & Related papers (2023-08-10T14:32:18Z)
IoU-Enhanced Attention for End-to-End Task Specific Object Detection [17.617133414432836]
R-CNN achieves promising results without densely tiled anchor boxes or grid points in the image. Due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention.
arXiv Detail & Related papers (2022-09-21T14:36:18Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
Multi-scale and Cross-scale Contrastive Learning for Semantic Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks. By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z)
AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance. We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations. AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
Background-Aware 3D Point Cloud Segmentationwith Dynamic Point Feature Aggregation [12.093182949686781]
We propose a novel 3D point cloud learning network, referred to as Dynamic Point Feature Aggregation Network (DPFA-Net) DPFA-Net has two variants for semantic segmentation and classification of 3D point clouds. It achieves the state-of-the-art overall accuracy score for semantic segmentation on the S3DIS dataset.
arXiv Detail & Related papers (2021-11-14T05:46:05Z)
DFNet: Discriminative feature extraction and integration network for salient object detection [6.959742268104327]
We focus on two aspects of challenges in saliency detection using Convolutional Neural Networks. Firstly, since salient objects appear in various sizes, using single-scale convolution would not capture the right size. Secondly, using multi-level features helps the model use both local and global context.
arXiv Detail & Related papers (2020-04-03T13:56:41Z)
Improving Few-shot Learning by Spatially-aware Matching and CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario. We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.