Related papers: Exploring Machine Learning and Language Models for Multimodal Depression Detection

Exploring Machine Learning and Language Models for Multimodal Depression Detection

URL: http://arxiv.org/abs/2508.20805v1
Date: Thu, 28 Aug 2025 14:07:07 GMT
Title: Exploring Machine Learning and Language Models for Multimodal Depression Detection
Authors: Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao,
Abstract summary: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge.<n>We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features.<n>Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities.
Score: 8.357574678947245
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models [23.966623683606425]
Depression is one of the most prevalent mental health disorders globally.<n>We propose a novel multi-modal LLM framework for depression detection.<n>Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level.
arXiv Detail & Related papers (2025-11-25T03:38:05Z)
MDD-Net: Multimodal Depression Detection through Mutual Transformer [1.18749525824656]
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals.<n>A Multimodal Depression Detection Network (MDD-Net) is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection.<n>The developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score.
arXiv Detail & Related papers (2025-08-11T15:32:56Z)
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z)
Generating Medically-Informed Explanations for Depression Detection using LLMs [1.325953054381901]
Early detection of depression from social media data offers a valuable opportunity for timely intervention.<n>We propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that combines the power of large language models with the crucial aspect of explainability.
arXiv Detail & Related papers (2025-03-18T19:23:22Z)
Context-Aware Deep Learning for Multi Modal Depression Detection [41.02897689721331]
We focus on automated approaches to detect depression from clinical interviews using multi-modal machine learning (ML)<n>We propose a novel method that incorporates: (1) pre-trained Transformer combined with data augmentation based on topic modelling for textual data; and (2) deep 1D convolutional neural network (CNN) for acoustic feature modeling.<n>Our deep 1D CNN and Transformer models achieved state-of-the-art performance for audio and text modalities respectively.
arXiv Detail & Related papers (2024-12-26T13:19:26Z)
Automated Ensemble Multimodal Machine Learning for Healthcare [52.500923923797835]
We introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning. AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies.
arXiv Detail & Related papers (2024-07-25T17:46:38Z)
Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities. Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts. Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
Multi-Modal Perceiver Language Model for Outcome Prediction in Emergency Department [0.03088120935391119]
We are interested in outcome prediction and patient triage in hospital emergency department based on text information in chief complaints and vital signs recorded at triage. We adapt Perceiver - a modality-agnostic transformer-based model that has shown promising results in several applications. In the experimental analysis, we show that mutli-modality improves the prediction performance compared with models trained solely on text or vital signs.
arXiv Detail & Related papers (2023-04-03T06:32:00Z)
Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs. We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z)
Multimodal Depression Severity Prediction from medical bio-markers using Machine Learning Tools and Technologies [0.0]
Depression has been a leading cause of mental-health illnesses across the world. Using behavioural cues to automate depression diagnosis and stage prediction in recent years has relatively increased. The absence of labelled behavioural datasets and a vast amount of possible variations prove to be a major challenge in accomplishing the task.
arXiv Detail & Related papers (2020-09-11T20:44:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.