Video CLIP Model for Multi-View Echocardiography Interpretation
- URL: http://arxiv.org/abs/2504.18800v1
- Date: Sat, 26 Apr 2025 05:11:15 GMT
- Title: Video CLIP Model for Multi-View Echocardiography Interpretation
- Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda,
- Abstract summary: We develop a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports.<n>Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.
- Score: 0.4336394330456971
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Echocardiography involves recording videos of the heart using ultrasound, enabling clinicians to evaluate its condition. Recent advances in large-scale vision-language models (VLMs) have garnered attention for automating the interpretation of echocardiographic videos. However, most existing VLMs proposed for medical interpretation thus far rely on single-frame (i.e., image) inputs. Consequently, these image-based models often exhibit lower diagnostic accuracy for conditions identifiable through cardiac motion. Moreover, echocardiographic videos are recorded from various views that depend on the direction of ultrasound emission, and certain views are more suitable than others for interpreting specific conditions. Incorporating multiple views could potentially yield further improvements in accuracy. In this study, we developed a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports from 60,747 cases. Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.
Related papers
- EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance [79.66329903007869]
We present EchoWorld, a motion-aware world modeling framework for probe guidance.<n>It encodes anatomical knowledge and motion-induced visual dynamics.<n>It is trained on more than one million ultrasound images from over 200 routine scans.
arXiv Detail & Related papers (2025-04-17T16:19:05Z) - MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer [6.520396145278936]
We introduce a visual query-based video clip localization (VQ) method to assist sonographers by enabling them to capture a quick US sweep.
MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies.
Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens.
arXiv Detail & Related papers (2025-04-08T14:29:15Z) - EchoFM: Foundation Model for Generalizable Echocardiogram Analysis [22.585990526913246]
We introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos.<n>In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability.<n>We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos, with up to 20 million frames of images.
arXiv Detail & Related papers (2024-10-30T19:32:02Z) - EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation [1.0840985826142429]
We introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs.
With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study.
In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function.
arXiv Detail & Related papers (2024-10-13T03:04:22Z) - Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection.
By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models.
V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Echocardiography video synthesis from end diastolic semantic map via
diffusion model [0.0]
This paper aims to tackle the challenges by expanding upon existing video diffusion models for the purpose of cardiac video synthesis.
Our focus lies in generating video using semantic maps of the initial frame during the cardiac cycle, commonly referred to as end diastole.
Our model exhibits better performance compared to the standard diffusion technique in terms of multiple metrics, including FID, FVD, and SSMI.
arXiv Detail & Related papers (2023-10-11T02:08:05Z) - GEMTrans: A General, Echocardiography-based, Multi-Level Transformer
Framework for Cardiovascular Diagnosis [14.737295160286939]
Vision-based machine learning (ML) methods have gained popularity to act as secondary layers of verification.
We propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explainability.
We show the flexibility of our framework by considering two critical tasks including ejection fraction (EF) and aortic stenosis (AS) severity detection.
arXiv Detail & Related papers (2023-08-25T07:30:18Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Voice-assisted Image Labelling for Endoscopic Ultrasound Classification
using Neural Networks [48.732863591145964]
We propose a multi-modal convolutional neural network architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure.
Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels.
arXiv Detail & Related papers (2021-10-12T21:22:24Z) - Neural collaborative filtering for unsupervised mitral valve
segmentation in echocardiography [60.08918310097638]
We propose an automated and unsupervised method for the mitral valve segmentation based on a low dimensional embedding of the echocardiography videos.
The method is evaluated in a collection of echocardiography videos of patients with a variety of mitral valve diseases and on an independent test cohort.
It outperforms state-of-the-art emphunsupervised and emphsupervised methods on low-quality videos or in the case of sparse annotation.
arXiv Detail & Related papers (2020-08-13T12:53:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.