When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class
Medical Image Semantic Segmentation
- URL: http://arxiv.org/abs/2208.06449v2
- Date: Thu, 8 Feb 2024 22:55:52 GMT
- Title: When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class
Medical Image Semantic Segmentation
- Authors: Ziyang Wang, Tianze Li, Jian-Qing Zheng, Baoru Huang
- Abstract summary: In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented.
Our framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes.
Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set.
- Score: 13.911947592067678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the lack of quality annotation in medical imaging community,
semi-supervised learning methods are highly valued in image semantic
segmentation tasks. In this paper, an advanced consistency-aware
pseudo-label-based self-ensembling approach is presented to fully utilize the
power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in
semi-supervised learning. Our proposed framework consists of a feature-learning
module which is enhanced by ViT and CNN mutually, and a guidance module which
is robust for consistency-aware purposes. The pseudo labels are inferred and
utilized recurrently and separately by views of CNN and ViT in the
feature-learning module to expand the data set and are beneficial to each
other. Meanwhile, a perturbation scheme is designed for the feature-learning
module, and averaging network weight is utilized to develop the guidance
module. By doing so, the framework combines the feature-learning strength of
CNN and ViT, strengthens the performance via dual-view co-training, and enables
consistency-aware supervision in a semi-supervised manner. A topological
exploration of all alternative supervision modes with CNN and ViT are detailed
validated, demonstrating the most promising performance and specific setting of
our method on semi-supervised medical image segmentation tasks. Experimental
results show that the proposed method achieves state-of-the-art performance on
a public benchmark data set with a variety of metrics. The code is publicly
available.
Related papers
- Intrapartum Ultrasound Image Segmentation of Pubic Symphysis and Fetal Head Using Dual Student-Teacher Framework with CNN-ViT Collaborative Learning [1.5233179662962222]
The segmentation of the pubic symphysis and fetal head (PSFH) constitutes a pivotal step in monitoring labor progression and identifying potential delivery complications.
Traditional semi-supervised learning approaches primarily utilize a unified network model based on Convolutional Neural Networks (CNNs)
We introduce a novel framework, the Dual-Student and Teacher Combining CNN and Transformer (DSTCT)
arXiv Detail & Related papers (2024-09-11T00:57:31Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation [11.637738540262797]
This study introduces Semi-Mamba-UNet, which integrates a purely visual Mamba-based encoder-decoder architecture with a conventional CNN-based UNet into a semi-supervised learning framework.
This innovative SSL approach leverages both networks to generate pseudo-labels and cross-supervise one another at the pixel level simultaneously.
We introduce a self-supervised pixel-level contrastive learning strategy that employs a pair of projectors to enhance the feature learning capabilities further.
arXiv Detail & Related papers (2024-02-11T17:09:21Z) - Multi-dimensional Fusion and Consistency for Semi-supervised Medical
Image Segmentation [10.628250457432499]
We introduce a novel semi-supervised learning framework tailored for medical image segmentation.
Central to our approach is the innovative Multi-scale Text-aware ViT-CNN Fusion scheme.
We propose the Multi-Axis Consistency framework for generating robust pseudo labels.
arXiv Detail & Related papers (2023-09-12T22:21:14Z) - R-Cut: Enhancing Explainability in Vision Transformers with Relationship
Weighted Out and Cut [14.382326829600283]
We introduce two modules: the Relationship Weighted Out" and the Cut" modules.
The Cut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color.
We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset.
arXiv Detail & Related papers (2023-07-18T08:03:51Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Deep Image Clustering with Contrastive Learning and Multi-scale Graph
Convolutional Networks [58.868899595936476]
This paper presents a new deep clustering approach termed image clustering with contrastive learning and multi-scale graph convolutional networks (IcicleGCN)
Experiments on multiple image datasets demonstrate the superior clustering performance of IcicleGCN over the state-of-the-art.
arXiv Detail & Related papers (2022-07-14T19:16:56Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.