MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
- URL: http://arxiv.org/abs/2504.06088v1
- Date: Tue, 08 Apr 2025 14:29:15 GMT
- Title: MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
- Authors: Divyanshu Mishra, Pramit Saha, He Zhao, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris Papageorghiou, J. Alison Noble,
- Abstract summary: We introduce a visual query-based video clip localization (VQ) method to assist sonographers by enabling them to capture a quick US sweep.<n>MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies.<n>Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens.
- Score: 6.520396145278936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.
Related papers
- Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos [58.71502465551297]
Intrapartum Ultrasound Grand Challenge (IUGC) co-hosted with MICCAI 2024 was launched.<n>IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry.<n>The challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals.
arXiv Detail & Related papers (2026-02-13T13:28:22Z) - FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound [2.8097961263689406]
The demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers.<n>Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners.<n>We present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate Vision-Language Models (VLMs)<n>Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis.
arXiv Detail & Related papers (2025-12-25T04:54:37Z) - Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation [83.02147613524032]
We introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis.<n>We propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations.<n>FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions.
arXiv Detail & Related papers (2025-10-14T19:57:03Z) - TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z) - Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos [22.437678884189697]
This study proposes a novel video classification method based on CNN and LSTM.
It reduces CNN-extracted image features to 1x512 dimension, followed by sorting and compressing feature vectors for LSTM training.
Experimental results demonstrate that our variable-frame CNNLSTM method outperforms other approaches across all metrics.
arXiv Detail & Related papers (2025-02-17T06:35:37Z) - A Multimodal Approach For Endoscopic VCE Image Classification Using BiomedCLIP-PubMedBERT [0.62914438169038]
This Paper presents an advanced approach for fine-tuning BiomedCLIP PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule Endoscopy frames.<n>Our method categorizes images into ten specific classes: angioectasia, bleeding, erosion, erythema, foreign body, lymphangiectasia, polyp, ulcer, worms, and normal.<n>Performance metrics, including classification, accuracy, recall, and F1 score, indicate the models strong ability to accurately identify abnormalities in endoscopic frames.
arXiv Detail & Related papers (2024-10-25T19:42:57Z) - Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis [9.530028450239394]
The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data.
Pre-trained audio encoders are utilized to encode the patient voice to get the audio features.
Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks.
arXiv Detail & Related papers (2024-09-05T14:56:38Z) - MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video [13.231546105751015]
We present the first automated multimodal generation, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis.
MMSummary is designed as a three-stage pipeline, progressing from anatomy detection to captioning and finally segmentation and measurement.
Based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance workflow efficiency.
arXiv Detail & Related papers (2024-08-07T13:30:58Z) - Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection.
Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels.
Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z) - Automated interpretation of congenital heart disease from multi-view
echocardiograms [10.238433789459624]
Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China.
This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework.
arXiv Detail & Related papers (2023-11-30T18:37:21Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Data-Efficient Vision Transformers for Multi-Label Disease
Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images.
ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present.
Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z) - Preservation of High Frequency Content for Deep Learning-Based Medical
Image Classification [74.84221280249876]
An efficient analysis of large amounts of chest radiographs can aid physicians and radiologists.
We propose a novel Discrete Wavelet Transform (DWT)-based method for the efficient identification and encoding of visual information.
arXiv Detail & Related papers (2022-05-08T15:29:54Z) - Statistical Dependency Guided Contrastive Learning for Multiple Labeling
in Prenatal Ultrasound [56.631021151764955]
Standard plane recognition plays an important role in prenatal ultrasound (US) screening.
We build a novel multi-label learning scheme to identify multiple standard planes and corresponding anatomical structures simultaneously.
arXiv Detail & Related papers (2021-08-11T06:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.