MECPformer: Multi-estimations Complementary Patch with CNN-Transformers
for Weakly Supervised Semantic Segmentation
- URL: http://arxiv.org/abs/2303.10689v1
- Date: Sun, 19 Mar 2023 15:42:45 GMT
- Title: MECPformer: Multi-estimations Complementary Patch with CNN-Transformers
for Weakly Supervised Semantic Segmentation
- Authors: Chunmeng Liu, Guangyao Li, Yao Shen, Ruiqi Wang
- Abstract summary: We propose a simple yet effective method with Multi-estimations Complementary Patch (MECP) strategy and Adaptive Conflict Module (ACM)
In addition, ACM adaptively removes conflicting pixels and exploits the network self-training capability to mine potential target information.
Our MECPformer has reached new state-of-the-art 72.0% mIoU on the PASCAL VOC 2012 and 42.4% on MS COCO 2014 dataset.
- Score: 8.975330500836057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The initial seed based on the convolutional neural network (CNN) for weakly
supervised semantic segmentation always highlights the most discriminative
regions but fails to identify the global target information. Methods based on
transformers have been proposed successively benefiting from the advantage of
capturing long-range feature representations. However, we observe a flaw
regardless of the gifts based on the transformer. Given a class, the initial
seeds generated based on the transformer may invade regions belonging to other
classes. Inspired by the mentioned issues, we devise a simple yet effective
method with Multi-estimations Complementary Patch (MECP) strategy and Adaptive
Conflict Module (ACM), dubbed MECPformer. Given an image, we manipulate it with
the MECP strategy at different epochs, and the network mines and deeply fuses
the semantic information at different levels. In addition, ACM adaptively
removes conflicting pixels and exploits the network self-training capability to
mine potential target information. Without bells and whistles, our MECPformer
has reached new state-of-the-art 72.0% mIoU on the PASCAL VOC 2012 and 42.4% on
MS COCO 2014 dataset. The code is available at
https://github.com/ChunmengLiu1/MECPformer.
Related papers
- Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical
Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance.
In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z) - SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for
Remote Sensing Images Change Detection [12.727650696327878]
We propose an end-to-end compounded dense network SwinV2DNet to inherit advantages of transformer and CNN.
It captures the change relationship features through the densely connected Swin V2 backbone.
It provides the low-level pre-changed and post-changed features through a CNN branch.
arXiv Detail & Related papers (2023-08-22T03:31:52Z) - Focal-UNet: UNet-like Focal Modulation for Medical Image Segmentation [8.75217589103206]
We propose a new U-shaped architecture for medical image segmentation with the help of the newly introduced focal modulation mechanism.
Due to the ability of the focal module to aggregate local and global features, our model could simultaneously benefit the wide receptive field of transformers.
arXiv Detail & Related papers (2022-12-19T06:17:22Z) - Max Pooling with Vision Transformers reconciles class and shape in
weakly supervised semantic segmentation [0.0]
This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM.
Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3%$ mIoU on PascalVOC 2012 $val$ set.
arXiv Detail & Related papers (2022-10-31T15:32:23Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - TransCAM: Transformer Attention-based CAM Refinement for Weakly
Supervised Semantic Segmentation [19.333543299407832]
We propose TransCAM, a Conformer-based solution to weakly supervised semantic segmentation.
We show that TransCAM achieves a new state-of-the-art performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets.
arXiv Detail & Related papers (2022-03-14T16:17:18Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.