Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation
- URL: http://arxiv.org/abs/2401.17828v2
- Date: Mon, 11 Mar 2024 04:59:43 GMT
- Title: Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation
- Authors: Rozhan Ahmadi, Shohreh Kasaei
- Abstract summary: This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
- Score: 12.103012959947055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, weakly supervised semantic segmentation using image-level
labels as supervision has received significant attention in the field of
computer vision. Most existing methods have addressed the challenges arising
from the lack of spatial information in these labels by focusing on
facilitating supervised learning through the generation of pseudo-labels from
class activation maps (CAMs). Due to the localized pattern detection of CNNs,
CAMs often emphasize only the most discriminative parts of an object, making it
challenging to accurately distinguish foreground objects from each other and
the background. Recent studies have shown that Vision Transformer (ViT)
features, due to their global view, are more effective in capturing the scene
layout than CNNs. However, the use of hierarchical ViTs has not been
extensively explored in this field. This work explores the use of Swin
Transformer by proposing "SWTformer" to enhance the accuracy of the initial
seed CAMs by bringing local and global views together. SWTformer-V1 generates
class probabilities and CAMs using only the patch tokens as features.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract
additional information and utilizes a background-aware mechanism to generate
more accurate localization maps with improved cross-object discrimination.
Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a
0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
It also yields comparable performance by 0.82% mIoU on average higher than
other methods in generating initial localization maps, depending only on the
classification network. SWTformer-V2 further improves the accuracy of the
generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the
local-to-global view provided by the Swin transformer. Code available at:
https://github.com/RozhanAhmadi/SWTformer
Related papers
- Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Max Pooling with Vision Transformers reconciles class and shape in
weakly supervised semantic segmentation [0.0]
This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM.
Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
arXiv Detail & Related papers (2022-10-31T15:32:23Z) - Dual Progressive Transformations for Weakly Supervised Semantic
Segmentation [23.68115323096787]
Weakly supervised semantic segmentation (WSSS) is a challenging task in computer vision.
We propose a Convolutional Neural Networks Refined Transformer (CRT) to mine a globally complete and locally accurate class activation maps.
Our proposed CRT achieves the new state-of-the-art performance on both the weakly supervised semantic segmentation task.
arXiv Detail & Related papers (2022-09-30T03:42:52Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - TransFG: A Transformer Architecture for Fine-grained Recognition [27.76159820385425]
Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
arXiv Detail & Related papers (2021-03-14T17:03:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.