A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation
- URL: http://arxiv.org/abs/2512.00084v1
- Date: Wed, 26 Nov 2025 06:57:11 GMT
- Title: A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation
- Authors: Venkata Siddharth Dhara, Pawan Kumar,
- Abstract summary: We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations.<n>Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images.<n>By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus.
- Score: 1.1348379236860462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent times, denoising diffusion probabilistic models (DPMs) have proven effective for medical image generation and denoising, and as representation learners for downstream segmentation. However, segmentation performance is limited by the need for dense pixel-wise labels, which are expensive, time-consuming, and require expert knowledge. We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations. Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images. Trained on MIMIC-III and MIMIC-IV, ModernBERT encodes clinical knowledge that guides cross-modal attention between visual and textual features. This study validates ModernBERT as a fast, scalable alternative to Clinical BioBERT in diffusion-based segmentation pipelines and highlights the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus, improving both segmentation accuracy and training efficiency over traditional diffusion-based models.
Related papers
- BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation [3.7276397365086233]
BiCLIP is a framework engineered to bolster robustness in medical segmentation.<n>It features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations.<n>It exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
arXiv Detail & Related papers (2026-02-25T18:11:47Z) - MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation [8.913012426353154]
We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation.<n>Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens.
arXiv Detail & Related papers (2026-02-23T23:46:05Z) - MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Image Segmentation [5.838464931565891]
We introduce MedCondDiff, a diffusion-based framework for medical image segmentation.<n>The model conditions the denoising process on semantic priors extracted by a Pyramid Vision Transformer (PVT) backbone.<n>This design improves robustness while reducing both inference time and VRAM usage.
arXiv Detail & Related papers (2025-11-29T06:43:15Z) - Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model [5.158113225132093]
Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation.<n>Existing methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo-labels.<n>Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype-based contrastive consistency.
arXiv Detail & Related papers (2025-07-22T10:21:55Z) - SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging [12.707029435622953]
This paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT)<n>SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps.<n>This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals.
arXiv Detail & Related papers (2025-07-21T13:18:05Z) - PathSegDiff: Pathology Segmentation using Diffusion model representations [63.20694440934692]
We propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors.<n>Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H&E stained histopathology images.<n>Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets.
arXiv Detail & Related papers (2025-04-09T14:58:21Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models [5.865983529245793]
TextDiff improves semantic representation through inexpensive medical text annotations.
We show that TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples.
arXiv Detail & Related papers (2024-07-07T10:21:08Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.