Breathing and Semantic Pause Detection and Exertion-Level Classification in Post-Exercise Speech
- URL: http://arxiv.org/abs/2509.15473v1
- Date: Thu, 18 Sep 2025 22:39:34 GMT
- Title: Breathing and Semantic Pause Detection and Exertion-Level Classification in Post-Exercise Speech
- Authors: Yuyu Wang, Wuyue Xia, Huaxiu Yao, Jingping Nie,
- Abstract summary: Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses.<n>We provide systematic annotations of pause types and conduct exploratory breathing and semantic pause detection and exertion-level classification across deep learning models.<n>Results show per-type detection accuracy up to 89$%$ for semantic, 55$%$ for breathing, 86$%$ for combined pauses, and 73$%$overall, while exertion-level classification achieves 90.5$%$ accuracy, outperformin prior work.
- Score: 33.39650261642241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses. Detecting these events enables assessment of recovery rate, lung function, and exertion-related abnormalities. However, existing works on identifying and distinguishing different types of pauses in this context are limited. In this work, building on a recently released dataset with synchronized audio and respiration signals, we provide systematic annotations of pause types. Using these annotations, we systematically conduct exploratory breathing and semantic pause detection and exertion-level classification across deep learning models (GRU, 1D CNN-LSTM, AlexNet, VGG16), acoustic features (MFCC, MFB), and layer-stratified Wav2Vec2 representations. We evaluate three setups-single feature, feature fusion, and a two-stage detection-classification cascade-under both classification and regression formulations. Results show per-type detection accuracy up to 89$\%$ for semantic, 55$\%$ for breathing, 86$\%$ for combined pauses, and 73$\%$overall, while exertion-level classification achieves 90.5$\%$ accuracy, outperformin prior work.
Related papers
- WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - Temporal-Aware Iterative Speech Model for Dementia Detection [0.0]
Current methods for automated dementia detection using speech rely on static, time-agnostic features or aggregated linguistic content.<n>We introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection.<n>Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.
arXiv Detail & Related papers (2025-09-26T01:56:07Z) - EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection [7.29257171556766]
We propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio.<n>Our method improves both the flexibility and applicability of respiratory sound detection.
arXiv Detail & Related papers (2025-09-01T06:10:30Z) - Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder [8.239710313549466]
This study integrates pause features with semantic coherence metrics across three datasets.<n>Key findings demonstrate that pause features alone robustly predict the severity of formal thought disorder (FTD)<n>These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech.
arXiv Detail & Related papers (2025-07-17T22:00:16Z) - Infusing Acoustic Pause Context into Text-Based Dementia Assessment [7.8642589679025034]
This work investigates the use of pause-enriched transcripts in language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment.
The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts.
arXiv Detail & Related papers (2024-08-27T16:44:41Z) - Seq2seq for Automatic Paraphasia Detection in Aphasic Speech [14.686874756530322]
Paraphasias are speech errors that are characteristic of aphasia and represent an important signal in assessing disease severity and subtype.
Traditionally, clinicians manually identify paraphasias by transcribing and analyzing speech-language samples.
We propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks.
arXiv Detail & Related papers (2023-12-16T18:22:37Z) - Leveraging Pretrained Representations with Task-related Keywords for
Alzheimer's Disease Detection [69.53626024091076]
Alzheimer's disease (AD) is particularly prominent in older adults.
Recent advances in pre-trained models motivate AD detection modeling to shift from low-level features to high-level representations.
This paper presents several efficient methods to extract better AD-related cues from high-level acoustic and linguistic features.
arXiv Detail & Related papers (2023-03-14T16:03:28Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection [54.366236719520565]
Temporal Activity Detection aims to predict activity classes per frame.
Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited.
This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
arXiv Detail & Related papers (2021-11-26T18:59:28Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [86.40973759048957]
This report describes a multi-modal multi-task ($M3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge.
In the proposed $M3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal.
We evaluated the $M3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
arXiv Detail & Related papers (2020-02-07T18:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.