Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- URL: http://arxiv.org/abs/2403.10153v3
- Date: Mon, 15 Jul 2024 14:35:13 GMT
- Title: Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- Authors: Yogesh Kumar, Pekka Marttinen,
- Abstract summary: eCLIP is an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps.
It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap"
- Score: 8.06905122449317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
Related papers
- Uncertainty-Aware Vision-Language Segmentation for Medical Imaging [12.545486211087791]
We introduce a novel uncertainty-aware multimodal segmentation framework for medical diagnosis.<n>We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion.<n>Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks.
arXiv Detail & Related papers (2026-02-16T06:27:51Z) - Multi-task Cross-modal Learning for Chest X-ray Image Retrieval [1.8648093673053043]
We propose a multi-task learning framework to fine-tune CLIP and BiomedCLIP for medical retrieval tasks.<n>We show that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks.<n>These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.
arXiv Detail & Related papers (2026-01-08T21:44:00Z) - TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation [56.09179939570486]
We propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations.<n>TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
arXiv Detail & Related papers (2025-12-24T12:06:26Z) - MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging [67.74482877175797]
MIRNet is a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning.<n>We introduce TongueAtlas-4K, a benchmark comprising 4,000 images annotated with 22 diagnostic labels.
arXiv Detail & Related papers (2025-11-13T06:30:41Z) - Generalized Contrastive Learning for Universal Multimodal Retrieval [53.70202081784898]
Cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality.<n>This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the need for new dataset curation.
arXiv Detail & Related papers (2025-09-30T01:25:04Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding [51.63264715941068]
textbfNEARL-CLIP (iunderlineNteracted quunderlineEry underlineAdaptation with ounderlineRthogonaunderlineL Regularization) is a novel cross-modality interaction VLM-based framework.
arXiv Detail & Related papers (2025-08-06T05:44:01Z) - Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings [44.99833362998488]
This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric language models for chest X-rays interpretation.<n>We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning.<n>Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities.
arXiv Detail & Related papers (2025-05-03T06:18:12Z) - MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation [4.760537994346813]
Medical image reporting aims to generate structured clinical descriptions from radiological images.
We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion.
We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results.
arXiv Detail & Related papers (2025-04-29T01:26:02Z) - Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis [16.268045905735818]
CMSwinKAN is a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification.
We introduce a soft voting mechanism guided by clinical insights to seamlessly bridge patch-level predictions to whole slide image-level classifications.
Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets.
arXiv Detail & Related papers (2025-04-18T15:39:46Z) - Interactive Tumor Progression Modeling via Sketch-Based Image Editing [54.47725383502915]
We propose SkEditTumor, a sketch-based diffusion model for controllable tumor progression editing.
By leveraging sketches as structural priors, our method enables precise modifications of tumor regions while maintaining structural integrity and visual realism.
Our contributions include a novel integration of sketches with diffusion models for medical image editing, fine-grained control over tumor progression visualization, and extensive validation across multiple datasets, setting a new benchmark in the field.
arXiv Detail & Related papers (2025-03-10T00:04:19Z) - Task-oriented Uncertainty Collaborative Learning for Label-Efficient Brain Tumor Segmentation [6.722672686635773]
Multi-contrast magnetic resonance imaging (MRI) plays a vital role in brain tumor segmentation and diagnosis.
Existing methods still face the challenges of multi-level specificity perception across different contrasts.
We propose a Task-oriented Uncertainty Collaborative Learning framework for multi-contrast MRI segmentation.
arXiv Detail & Related papers (2025-03-07T18:44:53Z) - Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning [5.660131312162423]
This paper proposes a novel Cross-Graph Modal Contrastive Learning framework for multimodal medical image classification.
The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset.
Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction.
arXiv Detail & Related papers (2024-10-23T01:25:25Z) - MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report [4.340464264725625]
We introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs) and radiology/cardiology reports.
We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention.
To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach.
arXiv Detail & Related papers (2024-10-21T17:42:41Z) - C-MELT: Contrastive Enhanced Masked Auto-Encoders for ECG-Language Pre-Training [10.088785685439134]
We propose C-MELT, a framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture.
C-MELT uniquely combines the strengths of generative with enhanced discriminative capabilities to achieve robust cross-modal representations.
arXiv Detail & Related papers (2024-10-03T01:24:09Z) - Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images [1.4680035572775536]
Vision-language models have emerged as a powerful tool for challenging multi-modal classification problem in the medical domain.
Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions.
In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images.
arXiv Detail & Related papers (2024-05-31T09:59:11Z) - SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survival [8.403756148610269]
Multimodal prediction of cancer patient survival offers a more comprehensive and precise approach.
This paper introduces SELECTOR, a heterogeneous graph-aware network based on convolutional mask encoders.
Our method significantly outperforms state-of-the-art methods in both modality-missing and intra-modality information-confirmed cases.
arXiv Detail & Related papers (2024-03-14T11:23:39Z) - XAI for In-hospital Mortality Prediction via Multimodal ICU Data [57.73357047856416]
We propose an efficient, explainable AI solution for predicting in-hospital mortality via multimodal ICU data.
We employ multimodal learning in our framework, which can receive heterogeneous inputs from clinical data and make decisions.
Our framework can be easily transferred to other clinical tasks, which facilitates the discovery of crucial factors in healthcare research.
arXiv Detail & Related papers (2023-12-29T14:28:04Z) - Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis.
We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z) - An Empirical Study of CLIP for Text-based Person Search [51.94743973155648]
Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions.
Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks.
This paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS tasks.
arXiv Detail & Related papers (2023-08-19T15:08:10Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Ambiguous Medical Image Segmentation using Diffusion Models [60.378180265885945]
We introduce a single diffusion model-based approach that produces multiple plausible outputs by learning a distribution over group insights.
Our proposed model generates a distribution of segmentation masks by leveraging the inherent sampling process of diffusion.
Comprehensive results show that our proposed approach outperforms existing state-of-the-art ambiguous segmentation networks.
arXiv Detail & Related papers (2023-04-10T17:58:22Z) - Competence-based Multimodal Curriculum Learning for Medical Report
Generation [98.10763792453925]
We propose a Competence-based Multimodal Curriculum Learning framework ( CMCL) to alleviate the data bias and make best use of available data.
Specifically, CMCL simulates the learning process of radiologists and optimize the model in a step by step manner.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
arXiv Detail & Related papers (2022-06-24T08:16:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.