Related papers: MMFformer: Multimodal Fusion Transformer Network for Depression Detection

MMFformer: Multimodal Fusion Transformer Network for Depression Detection

URL: http://arxiv.org/abs/2508.06701v1
Date: Fri, 08 Aug 2025 21:03:29 GMT
Title: MMFformer: Multimodal Fusion Transformer Network for Depression Detection
Authors: Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray,
Abstract summary: Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life.<n>This paper introduces a multimodal detection network designed to retrieve depressive-temporal high-level patterns from multimodal social media information.<n>The proposed network is assessed on two large-scale depression detection datasets.
Score: 1.18749525824656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.

Related papers

Towards Stable Cross-Domain Depression Recognition under Missing Modalities [46.292478012586066]
Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening.<n>We propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM)<n>The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources.
arXiv Detail & Related papers (2025-12-06T14:19:57Z)
MDD-Net: Multimodal Depression Detection through Mutual Transformer [1.18749525824656]
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals.<n>A Multimodal Depression Detection Network (MDD-Net) is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection.<n>The developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score.
arXiv Detail & Related papers (2025-08-11T15:32:56Z)
Enhancing Depression Detection via Question-wise Modality Fusion [47.45016610508853]
Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs.<n>We propose a novel Question-wise Modality Fusion framework trained with a novel Imbalanced Ordinal Log-Loss function.
arXiv Detail & Related papers (2025-03-26T12:34:34Z)
CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis [46.56667527672019]
We introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data.<n>Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings.
arXiv Detail & Related papers (2024-11-01T15:54:07Z)
A BERT-Based Summarization approach for depression detection [1.7363112470483526]
Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed. Machine learning and artificial intelligence can autonomously detect depression indicators from diverse data sources. Our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts.
arXiv Detail & Related papers (2024-09-13T02:14:34Z)
A Depression Detection Method Based on Multi-Modal Feature Fusion Using Cross-Attention [3.4872769952628926]
Depression affects approximately 3.8% of the global population. Over 75% of individuals in low- and middle-income countries remain untreated. This paper introduces a novel method for detecting depression based on multi-modal feature fusion utilizing cross-attention.
arXiv Detail & Related papers (2024-07-02T13:13:35Z)
Attention-Based Acoustic Feature Fusion Network for Depression Detection [11.972591489278988]
We present the Attention-Based Acoustic Feature Fusion Network (ABAFnet) for depression detection. ABAFnet combines four different acoustic features into a comprehensive deep learning model, thereby effectively integrating and blending multi-tiered features. We present a novel weight adjustment module for late fusion that boosts performance by efficaciously synthesizing these features.
arXiv Detail & Related papers (2023-08-24T00:31:51Z)
Brain Imaging-to-Graph Generation using Adversarial Hierarchical Diffusion Models for MCI Causality Analysis [44.45598796591008]
Brain imaging-to-graph generation (BIGG) framework is proposed to map functional magnetic resonance imaging (fMRI) into effective connectivity for mild cognitive impairment analysis. The hierarchical transformers in the generator are designed to estimate the noise at multiple scales. Evaluations of the ADNI dataset demonstrate the feasibility and efficacy of the proposed model.
arXiv Detail & Related papers (2023-05-18T06:54:56Z)
Fader Networks for domain adaptation on fMRI: ABIDE-II study [68.5481471934606]
We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.
arXiv Detail & Related papers (2020-10-14T16:50:50Z)
Multimodal Depression Severity Prediction from medical bio-markers using Machine Learning Tools and Technologies [0.0]
Depression has been a leading cause of mental-health illnesses across the world. Using behavioural cues to automate depression diagnosis and stage prediction in recent years has relatively increased. The absence of labelled behavioural datasets and a vast amount of possible variations prove to be a major challenge in accomplishing the task.
arXiv Detail & Related papers (2020-09-11T20:44:28Z)
Context-Aware Refinement Network Incorporating Structural Connectivity Prior for Brain Midline Delineation [50.868845400939314]
We propose a context-aware refinement network (CAR-Net) to refine and integrate the feature pyramid representation generated by the UNet. For keeping the structural connectivity of the brain midline, we introduce a novel connectivity regular loss. The proposed method requires fewer parameters and outperforms three state-of-the-art methods in terms of four evaluation metrics.
arXiv Detail & Related papers (2020-07-10T14:01:20Z)
M2Net: Multi-modal Multi-channel Network for Overall Survival Time Prediction of Brain Tumor Patients [151.4352001822956]
Early and accurate prediction of overall survival (OS) time can help to obtain better treatment planning for brain tumor patients. Existing prediction methods rely on radiomic features at the local lesion area of a magnetic resonance (MR) volume. We propose an end-to-end OS time prediction model; namely, Multi-modal Multi-channel Network (M2Net)
arXiv Detail & Related papers (2020-06-01T05:21:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.