When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class
Medical Image Semantic Segmentation
- URL: http://arxiv.org/abs/2208.06449v2
- Date: Thu, 8 Feb 2024 22:55:52 GMT
- Title: When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class
Medical Image Semantic Segmentation
- Authors: Ziyang Wang, Tianze Li, Jian-Qing Zheng, Baoru Huang
- Abstract summary: In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented.
Our framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes.
Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set.
- Score: 13.911947592067678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the lack of quality annotation in medical imaging community,
semi-supervised learning methods are highly valued in image semantic
segmentation tasks. In this paper, an advanced consistency-aware
pseudo-label-based self-ensembling approach is presented to fully utilize the
power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in
semi-supervised learning. Our proposed framework consists of a feature-learning
module which is enhanced by ViT and CNN mutually, and a guidance module which
is robust for consistency-aware purposes. The pseudo labels are inferred and
utilized recurrently and separately by views of CNN and ViT in the
feature-learning module to expand the data set and are beneficial to each
other. Meanwhile, a perturbation scheme is designed for the feature-learning
module, and averaging network weight is utilized to develop the guidance
module. By doing so, the framework combines the feature-learning strength of
CNN and ViT, strengthens the performance via dual-view co-training, and enables
consistency-aware supervision in a semi-supervised manner. A topological
exploration of all alternative supervision modes with CNN and ViT are detailed
validated, demonstrating the most promising performance and specific setting of
our method on semi-supervised medical image segmentation tasks. Experimental
results show that the proposed method achieves state-of-the-art performance on
a public benchmark data set with a variety of metrics. The code is publicly
available.
Related papers
- Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation [11.637738540262797]
This paper introduces the Semi-Mamba-UNet, which integrates a visual mamba-based UNet architecture with a conventional UNet into a semi-supervised learning (SSL) framework.
Our comprehensive evaluation on a publicly available MRI cardiac segmentation dataset highlights the superior performance of Semi-Mamba-UNet.
arXiv Detail & Related papers (2024-02-11T17:09:21Z) - Multi-dimensional Fusion and Consistency for Semi-supervised Medical
Image Segmentation [10.628250457432499]
We introduce a novel semi-supervised learning framework tailored for medical image segmentation.
Central to our approach is the innovative Multi-scale Text-aware ViT-CNN Fusion scheme.
We propose the Multi-Axis Consistency framework for generating robust pseudo labels.
arXiv Detail & Related papers (2023-09-12T22:21:14Z) - R-Cut: Enhancing Explainability in Vision Transformers with Relationship
Weighted Out and Cut [14.382326829600283]
We introduce two modules: the Relationship Weighted Out" and the Cut" modules.
The Cut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color.
We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset.
arXiv Detail & Related papers (2023-07-18T08:03:51Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Semi-Supervised Cross-Modal Salient Object Detection with U-Structure
Networks [18.12933868289846]
We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks.
We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features.
To reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model.
arXiv Detail & Related papers (2022-08-08T18:39:37Z) - Deep Image Clustering with Contrastive Learning and Multi-scale Graph
Convolutional Networks [58.868899595936476]
This paper presents a new deep clustering approach termed image clustering with contrastive learning and multi-scale graph convolutional networks (IcicleGCN)
Experiments on multiple image datasets demonstrate the superior clustering performance of IcicleGCN over the state-of-the-art.
arXiv Detail & Related papers (2022-07-14T19:16:56Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.