Related papers: MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

URL: http://arxiv.org/abs/2504.06088v1
Date: Tue, 08 Apr 2025 14:29:15 GMT
Title: MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer
Authors: Divyanshu Mishra, Pramit Saha, He Zhao, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris Papageorghiou, J. Alison Noble,
Abstract summary: We introduce a visual query-based video clip localization (VQ) method to assist sonographers by enabling them to capture a quick US sweep.<n>MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies.<n>Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens.
Score: 6.520396145278936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.

Related papers

Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos [22.437678884189697]
This study proposes a novel video classification method based on CNN and LSTM. It reduces CNN-extracted image features to 1x512 dimension, followed by sorting and compressing feature vectors for LSTM training. Experimental results demonstrate that our variable-frame CNNLSTM method outperforms other approaches across all metrics.
arXiv Detail & Related papers (2025-02-17T06:35:37Z)
A Multimodal Approach For Endoscopic VCE Image Classification Using BiomedCLIP-PubMedBERT [0.62914438169038]
This Paper presents an advanced approach for fine-tuning BiomedCLIP PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule Endoscopy frames.<n>Our method categorizes images into ten specific classes: angioectasia, bleeding, erosion, erythema, foreign body, lymphangiectasia, polyp, ulcer, worms, and normal.<n>Performance metrics, including classification, accuracy, recall, and F1 score, indicate the models strong ability to accurately identify abnormalities in endoscopic frames.
arXiv Detail & Related papers (2024-10-25T19:42:57Z)
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis [9.530028450239394]
The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks.
arXiv Detail & Related papers (2024-09-05T14:56:38Z)
MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video [13.231546105751015]
We present the first automated multimodal generation, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis. MMSummary is designed as a three-stage pipeline, progressing from anatomy detection to captioning and finally segmentation and measurement. Based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance workflow efficiency.
arXiv Detail & Related papers (2024-08-07T13:30:58Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
Automated interpretation of congenital heart disease from multi-view echocardiograms [10.238433789459624]
Congenital heart disease (CHD) is the most common birth defect and the leading cause of neonate death in China. This study proposes to automatically analyze the multi-view echocardiograms with a practical end-to-end framework.
arXiv Detail & Related papers (2023-11-30T18:37:21Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z)
Preservation of High Frequency Content for Deep Learning-Based Medical Image Classification [74.84221280249876]
An efficient analysis of large amounts of chest radiographs can aid physicians and radiologists. We propose a novel Discrete Wavelet Transform (DWT)-based method for the efficient identification and encoding of visual information.
arXiv Detail & Related papers (2022-05-08T15:29:54Z)
Statistical Dependency Guided Contrastive Learning for Multiple Labeling in Prenatal Ultrasound [56.631021151764955]
Standard plane recognition plays an important role in prenatal ultrasound (US) screening. We build a novel multi-label learning scheme to identify multiple standard planes and corresponding anatomical structures simultaneously.
arXiv Detail & Related papers (2021-08-11T06:39:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.