Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Cord Paralysis
- URL: http://arxiv.org/abs/2409.03597v1
- Date: Thu, 5 Sep 2024 14:56:38 GMT
- Title: Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Cord Paralysis
- Authors: Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Faya Liang, Ming Li,
- Abstract summary: The Multimodal Analyzing System for Laryngoscope (MASL) combines audio and video data to automatically extract key segments and metrics from laryngeal videostroboscopic videos for clinical assessment.
MASL integrates glottis detection with keyword spotting to analyze patient vocalizations and refine video highlights for better inspection of vocal cord movements.
- Score: 7.583632364503357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the Multimodal Analyzing System for Laryngoscope (MASL), a system that combines audio and video data to automatically extract key segments and metrics from laryngeal videostroboscopic videos for clinical assessment. MASL integrates glottis detection with keyword spotting to analyze patient vocalizations and refine video highlights for better inspection of vocal cord movements. The system includes a strobing video extraction module that identifies frames by analyzing hue, saturation, and value fluctuations. MASL also provides effective metrics for vocal cord paralysis detection, employing a two-stage glottis segmentation process using U-Net followed by diffusion-based refinement to reduce false positives. Instead of glottal area waveforms, MASL estimates anterior glottic angle waveforms (AGAW) from glottis masks, evaluating both left and right vocal cords to detect unilateral vocal cord paralysis (UVFP). By comparing AGAW variances, MASL distinguishes between left and right paralysis. Ablation studies and experiments on public and real-world datasets validate MASL's segmentation module and demonstrate its ability to provide reliable metrics for UVFP diagnosis.
Related papers
- Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Multi-View Spectrogram Transformer for Respiratory Sound Classification [32.346046623638394]
A Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer.
Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
arXiv Detail & Related papers (2023-11-16T08:17:02Z) - A Unified Approach for Comprehensive Analysis of Various Spectral and
Tissue Doppler Echocardiography [3.7775754350457746]
We introduce a novel unified framework using a convolutional neural network for comprehensive analysis of spectral and tissue Doppler echocardiography images.
The network automatically recognizes key features across various Doppler views, with novel Doppler shape embedding and anti-aliasing modules.
Empirical results indicate a consistent outperformance in performance metrics, including dice similarity coefficients (DSC) and intersection over union (IoU)
arXiv Detail & Related papers (2023-11-14T15:10:05Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Analysis and Detection of Pathological Voice using Glottal Source
Features [18.80191660913831]
Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method.
We derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF.
Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice.
arXiv Detail & Related papers (2023-09-25T12:14:25Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - GEMTrans: A General, Echocardiography-based, Multi-Level Transformer
Framework for Cardiovascular Diagnosis [14.737295160286939]
Vision-based machine learning (ML) methods have gained popularity to act as secondary layers of verification.
We propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explainability.
We show the flexibility of our framework by considering two critical tasks including ejection fraction (EF) and aortic stenosis (AS) severity detection.
arXiv Detail & Related papers (2023-08-25T07:30:18Z) - DopUS-Net: Quality-Aware Robotic Ultrasound Imaging based on Doppler
Signal [48.97719097435527]
DopUS-Net combines the Doppler images with B-mode images to increase the segmentation accuracy and robustness of small blood vessels.
An artery re-identification module qualitatively evaluate the real-time segmentation results and automatically optimize the probe pose for enhanced Doppler images.
arXiv Detail & Related papers (2023-05-15T18:19:29Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.