Multi-class Token Transformer for Weakly Supervised Semantic
Segmentation
- URL: http://arxiv.org/abs/2203.02891v1
- Date: Sun, 6 Mar 2022 07:18:23 GMT
- Title: Multi-class Token Transformer for Weakly Supervised Semantic
Segmentation
- Authors: Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, Dan Xu
- Abstract summary: We propose a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS)
Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization.
The proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets.
- Score: 94.78965643354285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new transformer-based framework to learn class-specific
object localization maps as pseudo labels for weakly supervised semantic
segmentation (WSSS). Inspired by the fact that the attended regions of the
one-class token in the standard vision transformer can be leveraged to form a
class-agnostic localization map, we investigate if the transformer model can
also effectively capture class-specific attention for more discriminative
object localization by learning multiple class tokens within the transformer.
To this end, we propose a Multi-class Token Transformer, termed as MCTformer,
which uses multiple class tokens to learn interactions between the class tokens
and the patch tokens. The proposed MCTformer can successfully produce
class-discriminative object localization maps from class-to-patch attentions
corresponding to different class tokens. We also propose to use a patch-level
pairwise affinity, which is extracted from the patch-to-patch transformer
attention, to further refine the localization maps. Moreover, the proposed
framework is shown to fully complement the Class Activation Mapping (CAM)
method, leading to remarkably superior WSSS results on the PASCAL VOC and MS
COCO datasets. These results underline the importance of the class token for
WSSS.
Related papers
- Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - All-pairs Consistency Learning for Weakly Supervised Semantic
Segmentation [42.66269050864235]
We propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS)
We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity.
Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train)
arXiv Detail & Related papers (2023-08-08T15:14:23Z) - MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic
Segmentation [90.73815426893034]
We propose a transformer-based framework that aims to enhance weakly supervised semantic segmentation.
We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens.
A Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens.
arXiv Detail & Related papers (2023-08-06T03:30:20Z) - MOST: Multiple Object localization with Self-supervised Transformers for
object discovery [97.47075050779085]
We present Multiple Object localization with Self-supervised Transformers (MOST)
MOST uses features of transformers trained using self-supervised learning to localize multiple objects in real world images.
We show MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.
arXiv Detail & Related papers (2023-04-11T17:57:27Z) - Distinguishability Calibration to In-Context Learning [31.375797763897104]
We propose a method to map a PLM-encoded embedding into a new metric space to guarantee the distinguishability of the resulting embeddings.
We also take the advantage of hyperbolic embeddings to capture the hierarchical relations among fine-grained class-associated token embedding.
arXiv Detail & Related papers (2023-02-13T09:15:00Z) - SemFormer: Semantic Guided Activation Transformer for Weakly Supervised
Semantic Segmentation [36.80638177024504]
We propose a novel transformer-based framework, named Semantic Guided Activation Transformer (SemFormer) for WSSS.
We design a transformer-based Class-Aware AutoEncoder (CAAE) to extract the class embeddings for the input image and learn class semantics for all classes of the dataset.
Our SemFormer achieves textbf74.3% mIoU and surpasses many recent mainstream WSSS approaches by a large margin on PASCAL VOC 2012 dataset.
arXiv Detail & Related papers (2022-10-26T10:51:20Z) - Saliency Guided Inter- and Intra-Class Relation Constraints for Weakly
Supervised Semantic Segmentation [66.87777732230884]
We propose a saliency guided Inter- and Intra-Class Relation Constrained (I$2$CRC) framework to assist the expansion of the activated object regions.
We also introduce an object guided label refinement module to take a full use of both the segmentation prediction and the initial labels for obtaining superior pseudo-labels.
arXiv Detail & Related papers (2022-06-20T03:40:56Z) - Cross-domain Detection Transformer based on Spatial-aware and
Semantic-aware Token Alignment [31.759205815348658]
We propose a new method called Spatial-aware and Semantic-aware Token Alignment (SSTA) for cross-domain detection transformers.
For spatial-aware token alignment, we can extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries.
For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embedding to guide the learning of a multi-class discriminator.
arXiv Detail & Related papers (2022-06-01T04:13:22Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.