BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
- URL: http://arxiv.org/abs/2603.00156v1
- Date: Wed, 25 Feb 2026 18:11:47 GMT
- Title: BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
- Authors: Saivan Talaei, Fatemeh Daneshfar, Abdulhady Abas Abdullah, Mustaqeem Khan,
- Abstract summary: BiCLIP is a framework engineered to bolster robustness in medical segmentation.<n>It features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations.<n>It exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
- Score: 3.7276397365086233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in "in-the-wild" clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
Related papers
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation [8.913012426353154]
We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation.<n>Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens.
arXiv Detail & Related papers (2026-02-23T23:46:05Z) - Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization [32.47484883374212]
Trustworthy clinical summarization requires fluent generation and transparency about where each statement comes from.<n>We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images.<n>We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment.
arXiv Detail & Related papers (2026-01-23T02:01:43Z) - Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification [2.5995006632251516]
We present a unified vision-language framework tailored for ENT endoscopy image analysis.<n>It simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval.<n>We achieve 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96.
arXiv Detail & Related papers (2025-08-31T09:03:39Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation [9.262045402495225]
BiPVL-Seg is an end-to-end framework that integrates vision-language fusion and embedding alignment.<n>BiPVL-Seg introduces progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders.<n>It incorporates global-local contrastive alignment, a training objective that enhances the text encoder's comprehension by aligning text and vision embeddings at both class and concept levels.
arXiv Detail & Related papers (2025-03-30T17:34:39Z) - MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation [2.2585213273821716]
We introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans.<n>Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss.<n>We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further.
arXiv Detail & Related papers (2024-09-28T23:10:37Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - CLIP in Medical Imaging: A Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.<n>The use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Cross-level Contrastive Learning and Consistency Constraint for
Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation.
With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.