Related papers: WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

URL: http://arxiv.org/abs/2304.01184v2
Date: Thu, 27 Apr 2023 03:03:50 GMT
Title: WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
Authors: Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang
Abstract summary: This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic (WSSS) We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014.
Score: 32.16796174578446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.

Related papers

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z)
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs. SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation [0.0]
This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
arXiv Detail & Related papers (2022-10-31T15:32:23Z)
SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation. We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)
TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation [19.333543299407832]
We propose TransCAM, a Conformer-based solution to weakly supervised semantic segmentation. We show that TransCAM achieves a new state-of-the-art performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets.
arXiv Detail & Related papers (2022-03-14T16:17:18Z)
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT) We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.