Related papers: Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

URL: http://arxiv.org/abs/2210.17400v1
Date: Mon, 31 Oct 2022 15:32:23 GMT
Title: Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation
Authors: Simone Rossetti (1 and 2), Damiano Zappia (1), Marta Sanzari (2), Marco Schaerf (1 and 2), Fiora Pirri (1 and 2) ((1) DeepPlants, (2) DIAG Sapienza)
Abstract summary: This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

Related papers

The Missing Point in Vision Transformers for Universal Image Segmentation [17.571552686063335]
We introduce ViT-P, a two-stage segmentation framework that decouples mask generation from classification.<n>ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers.<n>Experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P.
arXiv Detail & Related papers (2025-05-26T10:29:13Z)
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z)
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs. SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation [32.16796174578446]
This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic (WSSS) We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014.
arXiv Detail & Related papers (2023-04-03T17:54:10Z)
Attention-based Class Activation Diffusion for Weakly-Supervised Semantic Segmentation [98.306533433627]
extracting class activation maps (CAM) is a key step for weakly-supervised semantic segmentation (WSSS) This paper proposes a new method to couple CAM and Attention matrix in a probabilistic Diffusion way, and dub it AD-CAM. Experiments show that AD-CAM as pseudo labels can yield stronger WSSS models than the state-of-the-art variants of CAM.
arXiv Detail & Related papers (2022-11-20T10:06:32Z)
Weakly Supervised Semantic Segmentation via Progressive Patch Learning [39.87150496277798]
"Progressive Patch Learning" approach is proposed to improve the local details extraction of the classification. "Patch Learning" destructs the feature maps into patches and independently processes each local patch in parallel before the final aggregation. "Progressive Patch Learning" further extends the feature destruction and patch learning to multi-level granularities in a progressive manner.
arXiv Detail & Related papers (2022-09-16T09:54:17Z)
Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training. We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs) We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs) We design a Group Window Attention scheme following the Divide-and-Conquer strategy. We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z)
Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation [93.83369981759996]
We propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation. We propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning.
arXiv Detail & Related papers (2020-04-09T14:57:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.