Related papers: Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques

Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques

URL: http://arxiv.org/abs/2401.01583v2
Date: Mon, 26 Feb 2024 10:35:03 GMT
Title: Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques
Authors: Weijian Huang, Cheng Li, Hong-Yu Zhou, Jiarun Liu, Hao Yang, Yong Liang, Guangming Shi, Hairong Zheng, Shanshan Wang
Abstract summary: We propose a method that effectively exploits multi-scale information to enhance the performance of medical foundation models. We evaluate the effectiveness of the proposed method on six open-source datasets across different clinical tasks.
Score: 41.078761802053535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of medical vision-language foundation models has attracted significant attention in the field of medicine and healthcare due to their promising prospect in various clinical applications. While previous studies have commonly focused on feature learning at a single learning scale, investigation on integrating multi-scale information is lacking, which may hinder the potential for mutual reinforcement among these features. This paper aims to bridge this gap by proposing a method that effectively exploits multi-scale information to enhance the performance of medical foundation models. The proposed method simultaneously exploits features at the local, instance, modality and global aspects, facilitating comprehensive representation learning within the models. We evaluate the effectiveness of the proposed method on six open-source datasets across different clinical tasks, demonstrating its ability to enhance the performance of medical foundation models.

Related papers

A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models [28.34025112894094]
The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models.<n>The survey critically examines important datasets, evaluation metrics, and methodological innovations.<n>It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations.
arXiv Detail & Related papers (2025-07-31T10:49:21Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
We study the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models.<n>Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
arXiv Detail & Related papers (2025-06-05T16:10:47Z)
Biomedical Foundation Model: A Survey [84.26268124754792]
Foundation models are large-scale pre-trained models that learn from extensive unlabeled datasets. These models can be adapted to various applications such as question answering and visual understanding. This survey explores the potential of foundation models across diverse domains within biomedical fields.
arXiv Detail & Related papers (2025-03-03T22:42:00Z)
FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models [54.09244105445476]
This study introduces a novel knowledge injection approach, FedKIM, to scale the medical foundation model within a federated learning framework. FedKIM leverages lightweight local models to extract healthcare knowledge from private data and integrates this knowledge into a centralized foundation model. Our experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various settings.
arXiv Detail & Related papers (2024-08-17T15:42:29Z)
Exploration of Attention Mechanism-Enhanced Deep Learning Models in the Mining of Medical Textual Data [3.22071437711162]
The research explores the utilization of a deep learning model employing an attention mechanism in medical text mining. It aims to enhance the model's capability to identify essential medical information by incorporating deep learning and attention mechanisms.
arXiv Detail & Related papers (2024-05-23T00:20:14Z)
Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives [0.3749861135832073]
This research presents a novel multimodal data fusion methodology for pain behavior recognition. We introduce two key innovations: 1) integrating data-driven statistical relevance weights into the fusion strategy, and 2) incorporating human-centric movement characteristics into multimodal representation learning. Our findings have significant implications for promoting patient-centered healthcare interventions and supporting explainable clinical decision-making.
arXiv Detail & Related papers (2024-03-30T11:13:18Z)
Optimizing Skin Lesion Classification via Multimodal Data and Auxiliary Task Integration [54.76511683427566]
This research introduces a novel multimodal method for classifying skin lesions, integrating smartphone-captured images with essential clinical and demographic information. A distinctive aspect of this method is the integration of an auxiliary task focused on super-resolution image prediction. The experimental evaluations have been conducted using the PAD-UFES20 dataset, applying various deep-learning architectures.
arXiv Detail & Related papers (2024-02-16T05:16:20Z)
Review of multimodal machine learning approaches in healthcare [0.0]
Clinicians rely on a variety of data sources to make informed decisions. Recent advances in machine learning have facilitated the more efficient incorporation of multimodal data.
arXiv Detail & Related papers (2024-02-04T12:21:38Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects [2.1070612998322438]
The paper explores the transformative potential of multimodal models for clinical predictions. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist.
arXiv Detail & Related papers (2023-11-04T05:42:51Z)
Understanding the Tricks of Deep Learning in Medical Image Segmentation: Challenges and Future Directions [66.40971096248946]
In this paper, we collect a series of MedISeg tricks for different model implementation phases. We experimentally explore the effectiveness of these tricks on consistent baselines. We also open-sourced a strong MedISeg repository, where each component has the advantage of plug-and-play.
arXiv Detail & Related papers (2022-09-21T12:30:05Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.