Related papers: Temporal-Aware Iterative Speech Model for Dementia Detection

Temporal-Aware Iterative Speech Model for Dementia Detection

URL: http://arxiv.org/abs/2510.00030v1
Date: Fri, 26 Sep 2025 01:56:07 GMT
Title: Temporal-Aware Iterative Speech Model for Dementia Detection
Authors: Chukwuemeka Ugwu, Oluwafemi Oyeleke,
Abstract summary: Current methods for automated dementia detection using speech rely on static, time-agnostic features or aggregated linguistic content.<n>We introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection.<n>Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning systems often struggle with processing long sequences, where computational complexity can become a bottleneck. Current methods for automated dementia detection using speech frequently rely on static, time-agnostic features or aggregated linguistic content, lacking the flexibility to model the subtle, progressive deterioration inherent in speech production. These approaches often miss the dynamic temporal patterns that are critical early indicators of cognitive decline. In this paper, we introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection. The flexibility of our method is demonstrated through two key innovations: 1) Optical Flow-inspired Iterative Refinement: By treating spectrograms as sequential frames, this component uses a convolutional GRU to capture the fine-grained, frame-to-frame evolution of acoustic features. 2) Cross-Attention Based Prosodic Alignment: This component dynamically aligns spectral features with prosodic patterns, such as pitch and pauses, to create a richer representation of speech production deficits linked to functional decline (IADL). TAI-Speech adaptively models the temporal evolution of each utterance, enhancing the detection of cognitive markers. Experimental results on the DementiaBank dataset show that TAI-Speech achieves a strong AUC of 0.839 and 80.6\% accuracy, outperforming text-based baselines without relying on ASR. Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.

Related papers

TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph [28.536724593429398]
TEn-CATG is a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning.<n>We show that TEn-CATG achieves robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
arXiv Detail & Related papers (2025-09-04T10:32:40Z)
Dynamic Fusion Multimodal Network for SpeechWellness Detection [7.169178956727836]
Suicide is one of the leading causes of death among adolescents.<n>Previous suicide risk prediction studies have primarily focused on either textual or acoustic information in isolation.<n>We explore a lightweight multi-branch multimodal system based on a dynamic fusion mechanism for speech detection.
arXiv Detail & Related papers (2025-08-25T14:18:12Z)
Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder [8.239710313549466]
This study integrates pause features with semantic coherence metrics across three datasets.<n>Key findings demonstrate that pause features alone robustly predict the severity of formal thought disorder (FTD)<n>These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech.
arXiv Detail & Related papers (2025-07-17T22:00:16Z)
StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning [79.44594332189018]
Class-Incremental Learning (CIL) seeks to develop models that continuously learn new action categories over time without previously acquired knowledge.<n>Existing approaches either rely on forgetting, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling.<n>We propose a unified and exemplar-free VCIL framework that explicitly disentangles and preserves information.
arXiv Detail & Related papers (2025-05-20T06:46:51Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics [0.7083699704958353]
We propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach.<n>The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively.
arXiv Detail & Related papers (2025-04-04T07:11:12Z)
Dementia Insights: A Context-Based MultiModal Approach [0.3749861135832073]
Early detection is crucial for timely interventions that may slow disease progression.<n>Large pre-trained models (LPMs) for text and audio have shown promise in identifying cognitive impairments.<n>This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs.
arXiv Detail & Related papers (2025-03-03T06:46:26Z)
REST: Efficient and Accelerated EEG Seizure Analysis through Residual State Updates [54.96885726053036]
This paper introduces a novel graph-based residual state update mechanism (REST) for real-time EEG signal analysis. By leveraging a combination of graph neural networks and recurrent structures, REST efficiently captures both non-Euclidean geometry and temporal dependencies within EEG data. Our model demonstrates high accuracy in both seizure detection and classification tasks.
arXiv Detail & Related papers (2024-06-03T16:30:19Z)
Detecting Anomalies in Dynamic Graphs via Memory enhanced Normality [39.476378833827184]
Anomaly detection in dynamic graphs presents a significant challenge due to the temporal evolution of graph structures and attributes. We introduce a novel spatial- temporal memories-enhanced graph autoencoder (STRIPE) STRIPE significantly outperforms existing methods with 5.8% improvement in AUC scores and 4.62X faster in training time.
arXiv Detail & Related papers (2024-03-14T02:26:10Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
A Novel Speech Intelligibility Enhancement Model based on CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model. We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.