Related papers: MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

URL: http://arxiv.org/abs/2501.04184v2
Date: Mon, 13 Jan 2025 03:33:36 GMT
Title: MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives
Authors: Wisdom O. Ikezogwo, Kevin Zhang, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Linda Shapiro, Ranjay Krishna,
Abstract summary: MedicalNarratives is a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies.<n>Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes.<n>To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains.
Score: 11.242775987217032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities. Data, demo, code and models available at https://medical-narratives.github.io

Related papers

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
MedContext: Learning Contextual Cues for Efficient Volumetric Medical Segmentation [25.74088298769155]
We propose a universal training framework called MedContext for 3D medical segmentation. Our approach effectively learns self supervised contextual cues jointly with the supervised voxel segmentation task. The effectiveness of MedContext is validated across multiple 3D medical datasets and four state-of-the-art model architectures.
arXiv Detail & Related papers (2024-02-27T17:58:05Z)
HICH Image/Text (HICH-IT): Comprehensive Text and Image Datasets for Hypertensive Intracerebral Hemorrhage Research [12.479936404475803]
We introduce a new dataset in the medical field of hypertensive intracerebral hemorrhage (HICH) called HICH-IT. This dataset is designed to enhance the accuracy of artificial intelligence in the diagnosis and treatment of HICH.
arXiv Detail & Related papers (2024-01-29T07:44:09Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue System Development [1.4315915057750197]
We publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations. We propose a simple self-supervised training strategy with span-noise modelling that improves the performance.
arXiv Detail & Related papers (2023-04-27T17:59:53Z)
Suggestive Annotation of Brain Tumour Images with Gradient-guided Sampling [14.092503407739422]
We propose an efficient annotation framework for brain tumour images that is able to suggest informative sample images for human experts to annotate. Experiments show that training a segmentation model with only 19% suggestively annotated patient scans from BraTS 2019 dataset can achieve a comparable performance to training a model on the full dataset for whole tumour segmentation task.
arXiv Detail & Related papers (2020-06-26T13:39:49Z)
Learning Contextualized Document Representations for Healthcare Answer Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.