EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation
- URL: http://arxiv.org/abs/2410.09704v1
- Date: Sun, 13 Oct 2024 03:04:22 GMT
- Title: EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation
- Authors: Milos Vukadinovic, Xiu Tang, Neal Yuan, Paul Cheng, Debiao Li, Susan Cheng, Bryan He, David Ouyang,
- Abstract summary: We introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs.
With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study.
In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function.
- Score: 1.0840985826142429
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Echocardiography is the most widely used cardiac imaging modality, capturing ultrasound video data to assess cardiac structure and function. Artificial intelligence (AI) in echocardiography has the potential to streamline manual tasks and improve reproducibility and precision. However, most echocardiography AI models are single-view, single-task systems that do not synthesize complementary information from multiple views captured during a full exam, and thus lead to limited performance and scope of applications. To address this problem, we introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs. EchoPrime uses contrastive learning to train a unified embedding model for all standard views in a comprehensive echocardiogram study with representation of both rare and common diseases and diagnoses. EchoPrime then utilizes view-classification and a view-informed anatomic attention model to weight video-specific interpretations that accurately maps the relationship between echocardiographic views and anatomical structures. With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study and performs holistic comprehensive clinical echocardiography interpretation. In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function, surpassing the performance of both task-specific approaches and prior foundation models. Following rigorous clinical evaluation, EchoPrime can assist physicians in the automated preliminary assessment of comprehensive echocardiography.
Related papers
- EchoFM: Foundation Model for Generalizable Echocardiogram Analysis [22.585990526913246]
We introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos.
In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability.
We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos, with up to 20 million frames of images.
arXiv Detail & Related papers (2024-10-30T19:32:02Z) - EchoApex: A General-Purpose Vision Foundation Model for Echocardiography [9.202542805578432]
We introduce EchoApex, the first general-purpose vision foundation model echocardiography with applications on a variety of clinical practice.
Leveraging self-supervised learning, EchoApex is pretrained on over 20 million echo images from 11 clinical centres.
Compared to state-of-the-art task-specific models, EchoApex attains improved performance with a unified image encoding architecture.
arXiv Detail & Related papers (2024-10-14T21:10:56Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World Model [66.35766658717205]
There is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges.
We present a Cardiac Copilot system capable of providing real-time probe movement guidance.
The core innovation lies in proposing a data-driven world model, named Cardiac Dreamer, for representing cardiac spatial structures.
We train our model with real-world ultrasound data and corresponding probe motion from 110 routine clinical scans with 151K sample pairs by three certified sonographers.
arXiv Detail & Related papers (2024-06-19T02:42:29Z) - Automatic Cardiac Pathology Recognition in Echocardiography Images Using Higher Order Dynamic Mode Decomposition and a Vision Transformer for Small Datasets [2.0286377328378737]
Heart diseases are the main international cause of human defunction. According to the WHO, nearly 18 million people decease each year because of heart diseases.
In this work, an automatic cardiac pathology recognition system based on a novel deep learning framework is proposed.
arXiv Detail & Related papers (2024-04-30T14:16:45Z) - CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers [66.15847237150909]
We introduce a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images.
The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism.
We validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms.
arXiv Detail & Related papers (2024-03-21T15:13:36Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Multimodal Foundation Models For Echocardiogram Interpretation [0.24578723416255746]
We leverage 1,032,975 cardiac ultrasound videos and corresponding expert interpretations to develop EchoCLIP.
EchoCLIP displays strong zero-shot (not explicitly trained) performance in cardiac function assessment.
We also developed a long-context variant (EchoCLIP-R) with a custom echocardiography report text tokenizer.
arXiv Detail & Related papers (2023-08-29T23:45:54Z) - GEMTrans: A General, Echocardiography-based, Multi-Level Transformer
Framework for Cardiovascular Diagnosis [14.737295160286939]
Vision-based machine learning (ML) methods have gained popularity to act as secondary layers of verification.
We propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explainability.
We show the flexibility of our framework by considering two critical tasks including ejection fraction (EF) and aortic stenosis (AS) severity detection.
arXiv Detail & Related papers (2023-08-25T07:30:18Z) - Improving Radiology Summarization with Radiograph and Anatomy Prompts [60.30659124918211]
We propose a novel anatomy-enhanced multimodal model to promote impression generation.
In detail, we first construct a set of rules to extract anatomies and put these prompts into each sentence to highlight anatomy characteristics.
We utilize a contrastive learning module to align these two representations at the overall level and use a co-attention to fuse them at the sentence level.
arXiv Detail & Related papers (2022-10-15T14:05:03Z) - Factored Attention and Embedding for Unstructured-view Topic-related
Ultrasound Report Generation [70.7778938191405]
We propose a novel factored attention and embedding model (termed FAE-Gen) for the unstructured-view topic-related ultrasound report generation.
The proposed FAE-Gen mainly consists of two modules, i.e., view-guided factored attention and topic-oriented factored embedding, which capture the homogeneous and heterogeneous morphological characteristic across different views.
arXiv Detail & Related papers (2022-03-12T15:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.