MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic
Segmentation
- URL: http://arxiv.org/abs/2308.03005v1
- Date: Sun, 6 Aug 2023 03:30:20 GMT
- Title: MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic
Segmentation
- Authors: Lian Xu, Mohammed Bennamoun, Farid Boussaid, Hamid Laga, Wanli Ouyang,
Dan Xu
- Abstract summary: We propose a transformer-based framework that aims to enhance weakly supervised semantic segmentation.
We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens.
A Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens.
- Score: 90.73815426893034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel transformer-based framework that aims to enhance
weakly supervised semantic segmentation (WSSS) by generating accurate
class-specific object localization maps as pseudo labels. Building upon the
observation that the attended regions of the one-class token in the standard
vision transformer can contribute to a class-agnostic localization map, we
explore the potential of the transformer model to capture class-specific
attention for class-discriminative object localization by learning multiple
class tokens. We introduce a Multi-Class Token transformer, which incorporates
multiple class tokens to enable class-aware interactions with the patch tokens.
To achieve this, we devise a class-aware training strategy that establishes a
one-to-one correspondence between the output class tokens and the ground-truth
class labels. Moreover, a Contrastive-Class-Token (CCT) module is proposed to
enhance the learning of discriminative class tokens, enabling the model to
better capture the unique characteristics and properties of each class. As a
result, class-discriminative object localization maps can be effectively
generated by leveraging the class-to-patch attentions associated with different
class tokens. To further refine these localization maps, we propose the
utilization of patch-level pairwise affinity derived from the patch-to-patch
transformer attention. Furthermore, the proposed framework seamlessly
complements the Class Activation Mapping (CAM) method, resulting in
significantly improved WSSS performance on the PASCAL VOC 2012 and MS COCO 2014
datasets. These results underline the importance of the class token for WSSS.
Related papers
- PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z) - Boosting Semantic Segmentation from the Perspective of Explicit Class
Embeddings [19.997929884477628]
We explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely.
We propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features.
Our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset.
arXiv Detail & Related papers (2023-08-24T16:16:10Z) - All-pairs Consistency Learning for Weakly Supervised Semantic
Segmentation [42.66269050864235]
We propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS)
We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity.
Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train)
arXiv Detail & Related papers (2023-08-08T15:14:23Z) - Distinguishability Calibration to In-Context Learning [31.375797763897104]
We propose a method to map a PLM-encoded embedding into a new metric space to guarantee the distinguishability of the resulting embeddings.
We also take the advantage of hyperbolic embeddings to capture the hierarchical relations among fine-grained class-associated token embedding.
arXiv Detail & Related papers (2023-02-13T09:15:00Z) - Saliency Guided Inter- and Intra-Class Relation Constraints for Weakly
Supervised Semantic Segmentation [66.87777732230884]
We propose a saliency guided Inter- and Intra-Class Relation Constrained (I$2$CRC) framework to assist the expansion of the activated object regions.
We also introduce an object guided label refinement module to take a full use of both the segmentation prediction and the initial labels for obtaining superior pseudo-labels.
arXiv Detail & Related papers (2022-06-20T03:40:56Z) - Multi-class Token Transformer for Weakly Supervised Semantic
Segmentation [94.78965643354285]
We propose a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS)
Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization.
The proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2022-03-06T07:18:23Z) - Attribute Propagation Network for Graph Zero-shot Learning [57.68486382473194]
We introduce the attribute propagation network (APNet), which is composed of 1) a graph propagation model generating attribute vector for each class and 2) a parameterized nearest neighbor (NN) classifier.
APNet achieves either compelling performance or new state-of-the-art results in experiments with two zero-shot learning settings and five benchmark datasets.
arXiv Detail & Related papers (2020-09-24T16:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.