Related papers: MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video

MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video

URL: http://arxiv.org/abs/2408.03761v2
Date: Wed, 30 Oct 2024 12:08:08 GMT
Title: MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video
Authors: Xiaoqing Guo, Qianhui Men, J. Alison Noble,
Abstract summary: We present the first automated multimodal generation, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis. MMSummary is designed as a three-stage pipeline, progressing from anatomy detection to captioning and finally segmentation and measurement. Based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance workflow efficiency.
Score: 13.231546105751015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first automated multimodal summary generation system, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis. Imitating the examination process performed by a human sonographer, MMSummary is designed as a three-stage pipeline, progressing from keyframe detection to keyframe captioning and finally anatomy segmentation and measurement. In the keyframe detection stage, an innovative automated workflow is proposed to progressively select a concise set of keyframes, preserving sufficient video information without redundancy. Subsequently, we adapt a large language model to generate meaningful captions for fetal ultrasound keyframes in the keyframe captioning stage. If a keyframe is captioned as fetal biometry, the segmentation and measurement stage estimates biometric parameters by segmenting the region of interest according to the textual prior. The MMSummary system provides comprehensive summaries for fetal ultrasound examinations and based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance clinical workflow efficiency.

Related papers

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z)
Determining Fetal Orientations From Blind Sweep Ultrasound Video [1.3456699275044242]
The work distinguishes itself by introducing automated fetal lie prediction and by proposing an assistive paradigm that augments sonographer expertise rather than replacing it. Future research will focus on enhancing acquisition efficiency, and exploring real-time clinical integration to improve workflow and support for obstetric clinicians.
arXiv Detail & Related papers (2025-04-09T12:51:15Z)
MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer [6.520396145278936]
We introduce a visual query-based video clip localization (VQ) method to assist sonographers by enabling them to capture a quick US sweep. MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens.
arXiv Detail & Related papers (2025-04-08T14:29:15Z)
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis [9.530028450239394]
The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks.
arXiv Detail & Related papers (2024-09-05T14:56:38Z)
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis [53.809054774037214]
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports. It is the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations.
arXiv Detail & Related papers (2024-05-14T19:53:20Z)
Breast Ultrasound Report Generation using LangChain [58.07183284468881]
We propose the integration of multiple image analysis tools through a LangChain using Large Language Models (LLM) into the breast reporting process. Our method can accurately extract relevant features from ultrasound images, interpret them in a clinical context, and produce comprehensive and standardized reports.
arXiv Detail & Related papers (2023-12-05T00:28:26Z)
Multi-Task Learning Approach for Unified Biometric Estimation from Fetal Ultrasound Anomaly Scans [0.8213829427624407]
We propose a multi-task learning approach to classify the region into head, abdomen and femur. We were able to achieve a mean absolute error (MAE) of 1.08 mm on head circumference, 1.44 mm on abdomen circumference and 1.10 mm on femur length with a classification accuracy of 99.91%.
arXiv Detail & Related papers (2023-11-16T06:35:02Z)
Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM) This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z)
Weakly-Supervised Surgical Phase Recognition [19.27227976291303]
In this work we join concepts of graph segmentation with self-supervised learning to derive a random-walk solution for per-frame phase prediction. We validate our method by running experiments with the public Cholec80 dataset of laparoscopic cholecystectomy videos.
arXiv Detail & Related papers (2023-10-26T07:54:47Z)
Attentive Symmetric Autoencoder for Brain MRI Segmentation [56.02577247523737]
We propose a novel Attentive Symmetric Auto-encoder based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks. In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics. Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models.
arXiv Detail & Related papers (2022-09-19T09:43:19Z)
Global Multi-modal 2D/3D Registration via Local Descriptors Learning [0.3299877799532224]
We present a novel approach to solve the problem of registration of an ultrasound sweep to a pre-operative image. We learn dense keypoint descriptors from which we then estimate the registration. Our approach is evaluated on a clinical dataset of paired MR volumes and ultrasound sequences.
arXiv Detail & Related papers (2022-05-06T18:24:19Z)
Deep Learning for Ultrasound Beamforming [120.12255978513912]
Beamforming, the process of mapping received ultrasound echoes to the spatial image domain, lies at the heart of the ultrasound image formation chain. Modern ultrasound imaging leans heavily on innovations in powerful digital receive channel processing. Deep learning methods can play a compelling role in the digital beamforming pipeline.
arXiv Detail & Related papers (2021-09-23T15:15:21Z)
Unsupervised multi-latent space reinforcement learning framework for video summarization in ultrasound imaging [0.0]
The COVID-19 pandemic has highlighted the need for a tool to speed up triage in ultrasound scans. The proposed video-summarization technique is a step in this direction. We propose a new unsupervised reinforcement learning framework with novel rewards.
arXiv Detail & Related papers (2021-09-03T04:50:35Z)
FetalNet: Multi-task deep learning framework for fetal ultrasound biometric measurements [11.364211664829567]
We propose an end-to-end multi-task neural network called FetalNet with an attention mechanism and stacked module for fetal ultrasound scan video analysis. The main goal in fetal ultrasound video analysis is to find proper standard planes to measure the fetal head, abdomen and femur. Our method called FetalNet outperforms existing state-of-the-art methods in both classification and segmentation in fetal ultrasound video recordings.
arXiv Detail & Related papers (2021-07-14T19:13:33Z)
Hybrid Attention for Automatic Segmentation of Whole Fetal Head in Prenatal Ultrasound Volumes [52.53375964591765]
We propose the first fully-automated solution to segment the whole fetal head in US volumes. The segmentation task is firstly formulated as an end-to-end volumetric mapping under an encoder-decoder deep architecture. We then combine the segmentor with a proposed hybrid attention scheme (HAS) to select discriminative features and suppress the non-informative volumetric features.
arXiv Detail & Related papers (2020-04-28T14:43:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.