Exploring Machine Learning and Language Models for Multimodal Depression Detection
- URL: http://arxiv.org/abs/2508.20805v1
- Date: Thu, 28 Aug 2025 14:07:07 GMT
- Title: Exploring Machine Learning and Language Models for Multimodal Depression Detection
- Authors: Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao,
- Abstract summary: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge.<n>We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features.<n>Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities.
- Score: 8.357574678947245
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models [23.966623683606425]
Depression is one of the most prevalent mental health disorders globally.<n>We propose a novel multi-modal LLM framework for depression detection.<n>Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level.
arXiv Detail & Related papers (2025-11-25T03:38:05Z) - MDD-Net: Multimodal Depression Detection through Mutual Transformer [1.18749525824656]
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals.<n>A Multimodal Depression Detection Network (MDD-Net) is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection.<n>The developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score.
arXiv Detail & Related papers (2025-08-11T15:32:56Z) - Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - Generating Medically-Informed Explanations for Depression Detection using LLMs [1.325953054381901]
Early detection of depression from social media data offers a valuable opportunity for timely intervention.<n>We propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that combines the power of large language models with the crucial aspect of explainability.
arXiv Detail & Related papers (2025-03-18T19:23:22Z) - Context-Aware Deep Learning for Multi Modal Depression Detection [41.02897689721331]
We focus on automated approaches to detect depression from clinical interviews using multi-modal machine learning (ML)<n>We propose a novel method that incorporates: (1) pre-trained Transformer combined with data augmentation based on topic modelling for textual data; and (2) deep 1D convolutional neural network (CNN) for acoustic feature modeling.<n>Our deep 1D CNN and Transformer models achieved state-of-the-art performance for audio and text modalities respectively.
arXiv Detail & Related papers (2024-12-26T13:19:26Z) - Automated Ensemble Multimodal Machine Learning for Healthcare [52.500923923797835]
We introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning.
AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies.
arXiv Detail & Related papers (2024-07-25T17:46:38Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Multi-Modal Perceiver Language Model for Outcome Prediction in Emergency
Department [0.03088120935391119]
We are interested in outcome prediction and patient triage in hospital emergency department based on text information in chief complaints and vital signs recorded at triage.
We adapt Perceiver - a modality-agnostic transformer-based model that has shown promising results in several applications.
In the experimental analysis, we show that mutli-modality improves the prediction performance compared with models trained solely on text or vital signs.
arXiv Detail & Related papers (2023-04-03T06:32:00Z) - Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs.
We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z) - Multimodal Depression Severity Prediction from medical bio-markers using
Machine Learning Tools and Technologies [0.0]
Depression has been a leading cause of mental-health illnesses across the world.
Using behavioural cues to automate depression diagnosis and stage prediction in recent years has relatively increased.
The absence of labelled behavioural datasets and a vast amount of possible variations prove to be a major challenge in accomplishing the task.
arXiv Detail & Related papers (2020-09-11T20:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.