Related papers: When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

URL: http://arxiv.org/abs/2208.06449v2
Date: Thu, 8 Feb 2024 22:55:52 GMT
Title: When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation
Authors: Ziyang Wang, Tianze Li, Jian-Qing Zheng, Baoru Huang
Abstract summary: In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented. Our framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set.
Score: 13.911947592067678
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to the lack of quality annotation in medical imaging community, semi-supervised learning methods are highly valued in image semantic segmentation tasks. In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented to fully utilize the power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in semi-supervised learning. Our proposed framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. The pseudo labels are inferred and utilized recurrently and separately by views of CNN and ViT in the feature-learning module to expand the data set and are beneficial to each other. Meanwhile, a perturbation scheme is designed for the feature-learning module, and averaging network weight is utilized to develop the guidance module. By doing so, the framework combines the feature-learning strength of CNN and ViT, strengthens the performance via dual-view co-training, and enables consistency-aware supervision in a semi-supervised manner. A topological exploration of all alternative supervision modes with CNN and ViT are detailed validated, demonstrating the most promising performance and specific setting of our method on semi-supervised medical image segmentation tasks. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set with a variety of metrics. The code is publicly available.

Related papers

Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning [2.5355185243767986]
We propose the Masked Image Consistency and Discrepancy Learning (MICD) framework with three key modules. The Cross Feature Consistency (CFC) module fortifies information exchange and model robustness. The Cross Model Discrepancy (CMD) module utilizes EMA teacher networks to oversee outputs and preserve branch diversity.
arXiv Detail & Related papers (2025-03-18T08:20:35Z)
Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z)
Intrapartum Ultrasound Image Segmentation of Pubic Symphysis and Fetal Head Using Dual Student-Teacher Framework with CNN-ViT Collaborative Learning [1.5233179662962222]
The segmentation of the pubic symphysis and fetal head (PSFH) constitutes a pivotal step in monitoring labor progression and identifying potential delivery complications. Traditional semi-supervised learning approaches primarily utilize a unified network model based on Convolutional Neural Networks (CNNs) We introduce a novel framework, the Dual-Student and Teacher Combining CNN and Transformer (DSTCT)
arXiv Detail & Related papers (2024-09-11T00:57:31Z)
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation. Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z)
Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation [11.637738540262797]
This study introduces Semi-Mamba-UNet, which integrates a purely visual Mamba-based encoder-decoder architecture with a conventional CNN-based UNet into a semi-supervised learning framework. This innovative SSL approach leverages both networks to generate pseudo-labels and cross-supervise one another at the pixel level simultaneously. We introduce a self-supervised pixel-level contrastive learning strategy that employs a pair of projectors to enhance the feature learning capabilities further.
arXiv Detail & Related papers (2024-02-11T17:09:21Z)
Multi-dimensional Fusion and Consistency for Semi-supervised Medical Image Segmentation [10.628250457432499]
We introduce a novel semi-supervised learning framework tailored for medical image segmentation. Central to our approach is the innovative Multi-scale Text-aware ViT-CNN Fusion scheme. We propose the Multi-Axis Consistency framework for generating robust pseudo labels.
arXiv Detail & Related papers (2023-09-12T22:21:14Z)
R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut [14.382326829600283]
We introduce two modules: the Relationship Weighted Out" and the Cut" modules. The Cut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color. We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset.
arXiv Detail & Related papers (2023-07-18T08:03:51Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
Deep Image Clustering with Contrastive Learning and Multi-scale Graph Convolutional Networks [58.868899595936476]
This paper presents a new deep clustering approach termed image clustering with contrastive learning and multi-scale graph convolutional networks (IcicleGCN) Experiments on multiple image datasets demonstrate the superior clustering performance of IcicleGCN over the state-of-the-art.
arXiv Detail & Related papers (2022-07-14T19:16:56Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.