Max Pooling with Vision Transformers reconciles class and shape in
weakly supervised semantic segmentation
- URL: http://arxiv.org/abs/2210.17400v1
- Date: Mon, 31 Oct 2022 15:32:23 GMT
- Title: Max Pooling with Vision Transformers reconciles class and shape in
weakly supervised semantic segmentation
- Authors: Simone Rossetti (1 and 2), Damiano Zappia (1), Marta Sanzari (2),
Marco Schaerf (1 and 2), Fiora Pirri (1 and 2) ((1) DeepPlants, (2) DIAG
Sapienza)
- Abstract summary: This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM.
Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly Supervised Semantic Segmentation (WSSS) research has explored many
directions to improve the typical pipeline CNN plus class activation maps (CAM)
plus refinements, given the image-class label as the only supervision. Though
the gap with the fully supervised methods is reduced, further abating the
spread seems unlikely within this framework. On the other hand, WSSS methods
based on Vision Transformers (ViT) have not yet explored valid alternatives to
CAM. ViT features have been shown to retain a scene layout, and object
boundaries in self-supervised learning. To confirm these findings, we prove
that the advantages of transformers in self-supervised methods are further
strengthened by Global Max Pooling (GMP), which can leverage patch features to
negotiate pixel-label probability with class probability. This work proposes a
new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The
end-to-end presented network learns with a single optimization process, refined
shape and proper localization for segmentation masks. Our model outperforms the
state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU
on PascalVOC 2012 $val$ set. We show that our approach has the least set of
parameters, though obtaining higher accuracy than all other approaches. In a
sentence, quantitative and qualitative results of our method reveal that
ViT-PCM is an excellent alternative to CNN-CAM based architectures.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Attention Guided CAM: Visual Explanations of Vision Transformer Guided
by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision.
Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z) - Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - WeakTr: Exploring Plain Vision Transformer for Weakly-supervised
Semantic Segmentation [32.16796174578446]
This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic (WSSS)
We name this plain Transformer-based Weakly-supervised learning framework WeakTr.
It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014.
arXiv Detail & Related papers (2023-04-03T17:54:10Z) - Attention-based Class Activation Diffusion for Weakly-Supervised
Semantic Segmentation [98.306533433627]
extracting class activation maps (CAM) is a key step for weakly-supervised semantic segmentation (WSSS)
This paper proposes a new method to couple CAM and Attention matrix in a probabilistic Diffusion way, and dub it AD-CAM.
Experiments show that AD-CAM as pseudo labels can yield stronger WSSS models than the state-of-the-art variants of CAM.
arXiv Detail & Related papers (2022-11-20T10:06:32Z) - Weakly Supervised Semantic Segmentation via Progressive Patch Learning [39.87150496277798]
"Progressive Patch Learning" approach is proposed to improve the local details extraction of the classification.
"Patch Learning" destructs the feature maps into patches and independently processes each local patch in parallel before the final aggregation.
"Progressive Patch Learning" further extends the feature destruction and patch learning to multi-level granularities in a progressive manner.
arXiv Detail & Related papers (2022-09-16T09:54:17Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - Self-supervised Equivariant Attention Mechanism for Weakly Supervised
Semantic Segmentation [93.83369981759996]
We propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap.
Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation.
We propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning.
arXiv Detail & Related papers (2020-04-09T14:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.