Related papers: MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

URL: http://arxiv.org/abs/2403.04626v2
Date: Fri, 31 May 2024 00:12:59 GMT
Title: MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder
Authors: Lei Li, Tianfang Zhang, Xinglin Zhang, Jiaqi Liu, Bingqi Ma, Yan Luo, Tao Chen,
Abstract summary: We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
Score: 26.830574964308962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data, a common scenario in medical diagnostics. We verify that masking an image does not affect inter-modal learning. Furthermore, we propose the SVD loss to enhance the representation learning for characteristics of medical images, aiming to improve classification accuracy by leveraging the structural intricacies of such data. Our theory posits that masking encourages semantic preservation, robust feature extraction, regularization, domain adaptation, and invariance learning. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis. MedFLIP's scaling of the masking process marks an advancement in the field, offering a pathway to rapid and precise medical image analysis without the traditional computational bottlenecks. Through experiments and validation, MedFLIP demonstrates efficient performance improvements, helps for future research and application in medical diagnostics.

Related papers

Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models [15.530083855947987]
We propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR.<n>Med-RwR actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning.<n> Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models.
arXiv Detail & Related papers (2025-10-21T05:18:18Z)
Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis [2.6554246520306624]
Mask What Matters is a controllable text-guided masking framework for self-supervised medical image analysis.<n>It consistently outperforms existing MIM methods, achieving gains of up to +3.1 percentage points in classification accuracy.<n>It achieves these improvements with substantially lower overall masking ratios.
arXiv Detail & Related papers (2025-09-27T02:26:56Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training [29.02600107837688]
This paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning. First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM) Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts.
arXiv Detail & Related papers (2024-07-28T17:38:21Z)
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions. VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z)
OPTiML: Dense Semantic Invariance Using Optimal Transport for Self-Supervised Medical Image Representation [6.4136876268620115]
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. We introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
arXiv Detail & Related papers (2024-04-18T02:59:48Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations [4.022446255159328]
Interpretable deep learning models have received widespread attention in the field of image recognition. Many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. We propose feature-driven inference network (FeaInfNet) to solve these problems.
arXiv Detail & Related papers (2023-12-04T13:09:00Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering [7.2486693553383805]
Current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces. We propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA models.
arXiv Detail & Related papers (2022-12-21T02:48:15Z)
Attentive Symmetric Autoencoder for Brain MRI Segmentation [56.02577247523737]
We propose a novel Attentive Symmetric Auto-encoder based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks. In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics. Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models.
arXiv Detail & Related papers (2022-09-19T09:43:19Z)
Few-shot Medical Image Segmentation using a Global Correlation Network with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation. We construct our few-shot image segmentor using a deep convolutional network trained episodically. We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)
Medical Image Harmonization Using Deep Learning Based Canonical Mapping: Toward Robust and Generalizable Learning in Imaging [4.396671464565882]
We propose a new paradigm in which data from a diverse range of acquisition conditions are "harmonized" to a common reference domain. We test this approach on two example problems, namely MRI-based brain age prediction and classification of schizophrenia.
arXiv Detail & Related papers (2020-10-11T22:01:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.