TransCAM: Transformer Attention-based CAM Refinement for Weakly
Supervised Semantic Segmentation
- URL: http://arxiv.org/abs/2203.07239v1
- Date: Mon, 14 Mar 2022 16:17:18 GMT
- Title: TransCAM: Transformer Attention-based CAM Refinement for Weakly
Supervised Semantic Segmentation
- Authors: Ruiwen Li, Zheda Mai, Chiheb Trabelsi, Zhibo Zhang, Jongseong Jang,
Scott Sanner
- Abstract summary: We propose TransCAM, a Conformer-based solution to weakly supervised semantic segmentation.
We show that TransCAM achieves a new state-of-the-art performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets.
- Score: 19.333543299407832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly supervised semantic segmentation (WSSS) with only image-level
supervision is a challenging task. Most existing methods exploit Class
Activation Maps (CAM) to generate pixel-level pseudo labels for supervised
training. However, due to the local receptive field of Convolution Neural
Networks (CNN), CAM applied to CNNs often suffers from partial activation --
highlighting the most discriminative part instead of the entire object area. In
order to capture both local features and global representations, the Conformer
has been proposed to combine a visual transformer branch with a CNN branch. In
this paper, we propose TransCAM, a Conformer-based solution to WSSS that
explicitly leverages the attention weights from the transformer branch of the
Conformer to refine the CAM generated from the CNN branch. TransCAM is
motivated by our observation that attention weights from shallow transformer
blocks are able to capture low-level spatial feature similarities while
attention weights from deep transformer blocks capture high-level semantic
context. Despite its simplicity, TransCAM achieves a new state-of-the-art
performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and
test sets, showing the effectiveness of transformer attention-based refinement
of CAM for WSSS.
Related papers
- Semantic-Constraint Matching Transformer for Weakly Supervised Object
Localization [31.039698757869974]
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision.
Previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope.
We propose a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.
arXiv Detail & Related papers (2023-09-04T03:20:31Z) - Feature Shrinkage Pyramid for Camouflaged Object Detection with
Transformers [34.42710399235461]
Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection.
They suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders.
We propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features.
arXiv Detail & Related papers (2023-03-26T20:50:58Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Attention-based Class Activation Diffusion for Weakly-Supervised
Semantic Segmentation [98.306533433627]
extracting class activation maps (CAM) is a key step for weakly-supervised semantic segmentation (WSSS)
This paper proposes a new method to couple CAM and Attention matrix in a probabilistic Diffusion way, and dub it AD-CAM.
Experiments show that AD-CAM as pseudo labels can yield stronger WSSS models than the state-of-the-art variants of CAM.
arXiv Detail & Related papers (2022-11-20T10:06:32Z) - Max Pooling with Vision Transformers reconciles class and shape in
weakly supervised semantic segmentation [0.0]
This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM.
Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
arXiv Detail & Related papers (2022-10-31T15:32:23Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels.
We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z) - Self-supervised Equivariant Attention Mechanism for Weakly Supervised
Semantic Segmentation [93.83369981759996]
We propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap.
Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation.
We propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning.
arXiv Detail & Related papers (2020-04-09T14:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.