CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning
- URL: http://arxiv.org/abs/2508.03764v1
- Date: Mon, 04 Aug 2025 23:09:07 GMT
- Title: CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning
- Authors: Justin Luong, Hao Xue, Flora D. Salim,
- Abstract summary: CoughViT is a novel pre-training framework for learning general-purpose cough sound representations.<n>We employ masked data modelling to train a feature encoder in a self-supervised learning manner.<n> Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations.
- Score: 8.789624590579903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient's airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks.
Related papers
- GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z) - The Efficacy of Semantics-Preserving Transformations in Self-Supervised Learning for Medical Ultrasound [60.80780313225093]
This study systematically investigated the impact of data augmentation and preprocessing strategies in self-supervised learning for lung ultrasound.<n>Three data augmentation pipelines were assessed: a baseline pipeline commonly used across imaging domains, a novel semantic-preserving pipeline designed for ultrasound, and a distilled set of the most effective transformations from both pipelines.
arXiv Detail & Related papers (2025-04-10T16:26:47Z) - Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment [65.70317151363204]
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings.<n>In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition.<n>Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that removes hallucinations.
arXiv Detail & Related papers (2024-12-01T10:35:12Z) - Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers [14.144599890583308]
We propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set.
Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
arXiv Detail & Related papers (2024-08-28T09:40:40Z) - Automatic Detection of COVID-19 from Chest X-ray Images Using Deep Learning Model [3.8329708057847305]
corona virus ( 2019-nCoV) has been widely spreading since last year and has shaken the entire world.
Due to limited test kits, it is also a daunting task to test every patient with severe respiratory problems using conventional techniques.
We propose models using deep learning to show the effectiveness of diagnostic systems.
arXiv Detail & Related papers (2024-08-27T10:01:58Z) - Real-Time Magnetic Tracking and Diagnosis of COVID-19 via Machine
Learning [2.737411991771932]
The COVID-19 pandemic underscored the importance of reliable, noninvasive diagnostic tools for robust public health interventions.
In this work, we fused magnetic respiratory sensing technology (MRST) with machine learning (ML) to create a diagnostic platform for real-time tracking and diagnosis of COVID-19 and other respiratory diseases.
arXiv Detail & Related papers (2023-11-01T13:57:33Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - COVID-19 Detection System: A Comparative Analysis of System Performance Based on Acoustic Features of Cough Audio Signals [0.6963971634605796]
This research aims to explore various acoustic features that enhance the performance of machine learning (ML) models in detecting COVID-19 from cough signals.
It investigates the efficacy of three feature extraction techniques, including Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral Contrast features, when applied to two machine learning algorithms, Support Vector Machine (SVM) and Multilayer Perceptron (MLP)
The proposed system provides a practical solution and demonstrates state-of-the-art classification performance, with an AUC of 0.843 on the COUGHVID dataset and 0.953 on the Virufy
arXiv Detail & Related papers (2023-09-08T08:33:24Z) - A Survey of the Impact of Self-Supervised Pretraining for Diagnostic
Tasks with Radiological Images [71.26717896083433]
Self-supervised pretraining has been observed to be effective at improving feature representations for transfer learning.
This review summarizes recent research into its usage in X-ray, computed tomography, magnetic resonance, and ultrasound imaging.
arXiv Detail & Related papers (2023-09-05T19:45:09Z) - Towards the Identifiability and Explainability for Personalized Learner
Modeling: An Inductive Paradigm [36.60917255464867]
We propose an identifiable cognitive diagnosis framework (ID-CDF) based on a novel response-proficiency-response paradigm inspired by encoder-decoder models.
We show that ID-CDF can effectively address the problems without loss of diagnosis preciseness.
arXiv Detail & Related papers (2023-09-01T07:18:02Z) - RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19
Assessment in Primary Care [45.43645878061283]
We present a framework that performs knowledge graph construction from raw GP medical notes written during or after patient consultations.
Our knowledge graphs include information about existing patient symptoms, their duration, and their severity.
We apply our framework to consultation notes of COVID-19 patients in the UK.
arXiv Detail & Related papers (2023-06-17T23:35:51Z) - A Machine Learning Approach for Delineating Similar Sound Symptoms of
Respiratory Conditions on a Smartphone [0.0]
We leverage the improved computational and storage capabilities of modern smartphones to distinguish the respiratory sound symptoms using machine learning algorithms.
The appreciable performance of these algorithms on a mobile phone shows smartphone as an alternate tool for recognition and discrimination of respiratory symptoms in real-time scenarios.
arXiv Detail & Related papers (2021-10-15T07:24:30Z) - BiteNet: Bidirectional Temporal Encoder Network to Predict Medical
Outcomes [53.163089893876645]
We propose a novel self-attention mechanism that captures the contextual dependency and temporal relationships within a patient's healthcare journey.
An end-to-end bidirectional temporal encoder network (BiteNet) then learns representations of the patient's journeys.
We have evaluated the effectiveness of our methods on two supervised prediction and two unsupervised clustering tasks with a real-world EHR dataset.
arXiv Detail & Related papers (2020-09-24T00:42:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.