Related papers: TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

URL: http://arxiv.org/abs/2510.14922v1
Date: Thu, 16 Oct 2025 17:39:59 GMT
Title: TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
Authors: Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni,
Abstract summary: Depression is a widespread mental health disorder, yet its automatic detection remains challenging.<n>Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals.<n>We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text.
Score: 3.599572587929144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

Related papers

Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study [18.4135590766724]
Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities.<n>We propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan.
arXiv Detail & Related papers (2025-12-24T05:07:07Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Exploring Machine Learning and Language Models for Multimodal Depression Detection [8.357574678947245]
This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge.<n>We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features.<n>Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities.
arXiv Detail & Related papers (2025-08-28T14:07:07Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis [26.867610944625337]
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities.<n>Past research predominantly focused on improving representation learning techniques and feature fusion strategies.<n>We introduce a Text-oriented Cross-Attention Network (TCAN) emphasizing the predominant role of the text modality in MSA.
arXiv Detail & Related papers (2024-04-06T07:56:09Z)
Contrastive Learning on Multimodal Analysis of Electronic Health Records [15.392566551086782]
We propose a novel feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation.<n>Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning.<n>This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning.
arXiv Detail & Related papers (2024-03-22T03:01:42Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
A Novel Energy based Model Mechanism for Multi-modal Aspect-Based Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis. PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information. EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z)
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task. We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
Self-Supervised Multimodal Domino: in Search of Biomarkers for Alzheimer's Disease [19.86082635340699]
We propose a taxonomy of all reasonable ways to organize self-supervised representation-learning algorithms. We first evaluate models on toy multimodal MNIST datasets and then apply them to a multimodal neuroimaging dataset with Alzheimer's disease patients. Results show that the proposed approach outperforms previous self-supervised encoder-decoder methods.
arXiv Detail & Related papers (2020-12-25T20:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.