Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech
- URL: http://arxiv.org/abs/2510.08593v1
- Date: Sun, 05 Oct 2025 09:32:12 GMT
- Title: Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech
- Authors: Yuxin Li, Eng Siong Chng, Cuntai Guan,
- Abstract summary: Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
- Score: 51.14752758616364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
Related papers
- TFEC: Multivariate Time-Series Clustering via Temporal-Frequency Enhanced Contrastive Learning [22.300584288382012]
This paper proposes a Temporal-Frequency Enhanced Contrastive (TFEC) learning framework.<n>To preserve temporal structure while generating low-distortion representations, a temporal-frequency Co-EnHancement (CoEH) mechanism is introduced.<n>Experiments on six real-world benchmark datasets demonstrate TFEC's superiority, achieving 4.48% average NMI gains over SOTA methods.
arXiv Detail & Related papers (2026-01-12T13:59:27Z) - Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks [0.0]
Isolated Sign Language Recognition is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world.<n>We propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder.<n>Our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes.
arXiv Detail & Related papers (2025-12-11T11:50:03Z) - UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection [37.37926854174864]
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability.<n>We propose a novel Unimodal-generated Multimodal Contrastive Learning framework for cross-modal-rate deepfake detection.<n>Our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection.
arXiv Detail & Related papers (2025-11-24T10:56:22Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning [80.18781219542016]
Continual Learning (CL) empowers AI models to continuously learn from sequential task streams.<n>Recent parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance.<n>We propose Cross-subspace Knowledge Alignment and Aggregation (CKAA) to enhance robustness against misleading task-ids.
arXiv Detail & Related papers (2025-07-13T03:11:35Z) - A theoretical framework for self-supervised contrastive learning for continuous dependent data [79.62732169706054]
Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision.<n>We propose a novel theoretical framework for contrastive SSL tailored to emphsemantic independence between samples.<n>Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of $4.17$% and $2.08$%, respectively.
arXiv Detail & Related papers (2025-06-11T14:23:47Z) - Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation [21.345548821276097]
Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels.
We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online.
CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision.
arXiv Detail & Related papers (2024-02-27T21:08:23Z) - A Generically Contrastive Spatiotemporal Representation Enhancement for 3D Skeleton Action Recognition [10.403751563214113]
We propose a Contrastive Spatiotemporal Representation Enhancement (CSRE) framework to obtain more discriminative representations from the sequences.<n>Specifically, we decompose the representation into spatial- and temporal-specific features to explore fine-grained motion patterns.<n>To explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations.
arXiv Detail & Related papers (2023-12-23T02:54:41Z) - Exploring Self-supervised Skeleton-based Action Recognition in Occluded Environments [40.322770236718775]
We propose IosPSTL, a simple and effective self-supervised learning framework designed to handle occlusions.<n>IosPSTL combines a cluster-agnostic KNN imputer with an Occluded Partial Spatio-Temporal Learning (OPSTL) strategy.<n>OPSTL module incorporates Adaptive Spatial Masking (ASM) to make better use of intact, high-quality skeleton sequences during training.
arXiv Detail & Related papers (2023-09-21T12:51:11Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Contrastive Prototype Learning with Augmented Embeddings for Few-Shot
Learning [58.2091760793799]
We propose a novel contrastive prototype learning with augmented embeddings (CPLAE) model.
With a class prototype as an anchor, CPL aims to pull the query samples of the same class closer and those of different classes further away.
Extensive experiments on several benchmarks demonstrate that our proposed CPLAE achieves new state-of-the-art.
arXiv Detail & Related papers (2021-01-23T13:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.