Related papers: Advancing Medical Representation Learning Through High-Quality Data

Advancing Medical Representation Learning Through High-Quality Data

URL: http://arxiv.org/abs/2503.14377v1
Date: Tue, 18 Mar 2025 16:10:11 GMT
Title: Advancing Medical Representation Learning Through High-Quality Data
Authors: Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar, Franklin Ogidi, Shuvendu Roy, Sajad Ashkezari, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Arash Afkanpour, Elham Dolatabadi,
Abstract summary: We introduce Open-PMC, a high-quality medical dataset from PubMed Central.<n>In-text references provide richer medical context, extending beyond the abstract information typically found in captions.<n>We benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks.
Score: 14.522284057070395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Related papers

MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data [32.65971100171597]
We introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data.<n>We also present MedGround-35K, a novel multimodal medical dataset.
arXiv Detail & Related papers (2026-01-11T10:34:18Z)
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation [13.362188283113788]
Vision-language pretraining has emerged as a powerful paradigm in medical image analysis.<n>We propose a novel framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining.
arXiv Detail & Related papers (2025-12-03T04:55:54Z)
Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning [0.03214166687856062]
We introduce a scalable subfigure extraction pipeline based on transformer-based object detection.<n>We release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset.<n>We show improved performance across retrieval, zero-shot classification, and robustness benchmarks.
arXiv Detail & Related papers (2025-06-03T10:53:19Z)
In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review [18.178774133733686]
We propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications.<n>We discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle.
arXiv Detail & Related papers (2025-01-18T11:03:59Z)
MRGen: Segmentation Data Engine For Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically significant imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize training data, to train segmentation models for underrepresented modalities.
arXiv Detail & Related papers (2024-12-04T16:34:22Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap. We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z)
Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z)
Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database [0.4915744683251151]
The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years. However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly. This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity.
arXiv Detail & Related papers (2023-07-27T13:00:33Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.