Related papers: EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

URL: http://arxiv.org/abs/2512.12107v1
Date: Sat, 13 Dec 2025 00:48:31 GMT
Title: EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography
Authors: Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, Puneet Sharma,
Abstract summary: vision-language models (VLMs) have achieved broad success in natural images and certain medical domains.<n>We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset.<n>We propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives.
Score: 19.10644729648278
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

Related papers

Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting [8.162197738994479]
We introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate specialized echocardiography tools.<n>Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis.<n>We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8%, outperforming both general-purpose and biomedical video vision-language models.
arXiv Detail & Related papers (2025-12-06T23:27:54Z)
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation [83.02147613524032]
We introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis.<n>We propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations.<n>FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions.
arXiv Detail & Related papers (2025-10-14T19:57:03Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
EchoApex: A General-Purpose Vision Foundation Model for Echocardiography [9.202542805578432]
We introduce EchoApex, the first general-purpose vision foundation model echocardiography with applications on a variety of clinical practice. Leveraging self-supervised learning, EchoApex is pretrained on over 20 million echo images from 11 clinical centres. Compared to state-of-the-art task-specific models, EchoApex attains improved performance with a unified image encoding architecture.
arXiv Detail & Related papers (2024-10-14T21:10:56Z)
Multi-scale, Data-driven and Anatomically Constrained Deep Learning Image Registration for Adult and Fetal Echocardiography [4.923733944174007]
We propose a framework that combines three strategies for deep learning image registration in both fetal and adult echo. Our tests show that good anatomical topology and image textures are strongly linked to shape-encoded and data-driven adversarial losses. Our approach outperforms traditional non-DL gold standard registration approaches, including Optical Flow and Elastix.
arXiv Detail & Related papers (2023-09-02T05:33:31Z)
Multimodal Foundation Models For Echocardiogram Interpretation [0.24578723416255746]
We leverage 1,032,975 cardiac ultrasound videos and corresponding expert interpretations to develop EchoCLIP. EchoCLIP displays strong zero-shot (not explicitly trained) performance in cardiac function assessment. We also developed a long-context variant (EchoCLIP-R) with a custom echocardiography report text tokenizer.
arXiv Detail & Related papers (2023-08-29T23:45:54Z)
GEMTrans: A General, Echocardiography-based, Multi-Level Transformer Framework for Cardiovascular Diagnosis [14.737295160286939]
Vision-based machine learning (ML) methods have gained popularity to act as secondary layers of verification. We propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explainability. We show the flexibility of our framework by considering two critical tasks including ejection fraction (EF) and aortic stenosis (AS) severity detection.
arXiv Detail & Related papers (2023-08-25T07:30:18Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
MyoPS: A Benchmark of Myocardial Pathology Segmentation Combining Three-Sequence Cardiac Magnetic Resonance Images [84.02849948202116]
This work defines a new task of medical image analysis, i.e., to perform myocardial pathology segmentation (MyoPS) MyoPS combines three-sequence cardiac magnetic resonance (CMR) images, which was first proposed in the MyoPS challenge, in conjunction with MICCAI 2020. The challenge provided 45 paired and pre-aligned CMR images, allowing algorithms to combine the complementary information from the three CMR sequences for pathology segmentation.
arXiv Detail & Related papers (2022-01-10T06:37:23Z)
Semantic segmentation of multispectral photoacoustic images using deep learning [53.65837038435433]
Photoacoustic imaging has the potential to revolutionise healthcare. Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information. We present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images.
arXiv Detail & Related papers (2021-05-20T09:33:55Z)
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images. We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.