Retrieval-Augmented Multimodal Depression Detection
- URL: http://arxiv.org/abs/2511.01892v1
- Date: Wed, 29 Oct 2025 06:51:54 GMT
- Title: Retrieval-Augmented Multimodal Depression Detection
- Authors: Ruibo Hou, Shiyu Teng, Jiaqing Liu, Shurong Chai, Yinhao Li, Lanfen Lin, Yen-Wei Chen,
- Abstract summary: We propose a novel Retrieval-Augmented Generation (RAG) framework for depression detection.<n>Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset.<n>Our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.
- Score: 18.36451774538809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.
Related papers
- Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs [16.56946059161466]
We propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models.<n>We focus on the factual and emotional connections between images and texts in noisy social media data.<n>By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples.
arXiv Detail & Related papers (2025-11-21T10:06:32Z) - Emotion Transfer with Enhanced Prototype for Unseen Emotion Recognition in Conversation [64.70874527264543]
We introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time.<n>We propose ProEmoTrans, a prototype-based emotion transfer framework.<n>ProEmoTrans shows promise but still faces key challenges.
arXiv Detail & Related papers (2025-08-27T03:16:16Z) - Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [47.325269852330884]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Emotion-Aware Embedding Fusion in LLMs (Flan-T5, LLAMA 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation [0.5454121013433086]
This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications.<n>We introduce Emotion-Aware Embedding Fusion, a novel framework integrating hierarchical fusion and attention mechanisms.<n>The system can be integrated into existing mental health platforms to generate personalized responses based on retrieved therapy session data.
arXiv Detail & Related papers (2024-10-02T08:01:05Z) - Unsupervised Extractive Summarization of Emotion Triggers [56.50078267340738]
We develop new unsupervised learning models that can jointly detect emotions and summarize their triggers.
Our best approach, entitled Emotion-Aware Pagerank, incorporates emotion information from external sources combined with a language understanding module.
arXiv Detail & Related papers (2023-06-02T11:07:13Z) - REDAffectiveLM: Leveraging Affect Enriched Embedding and
Transformer-based Neural Language Model for Readers' Emotion Detection [3.6678641723285446]
We propose a novel approach for Readers' Emotion Detection from short-text documents using a deep learning model called REDAffectiveLM.
We leverage context-specific and affect enriched representations by using a transformer-based pre-trained language model in tandem with affect enriched Bi-LSTM+Attention.
arXiv Detail & Related papers (2023-01-21T19:28:25Z) - Leveraging Sentiment Analysis Knowledge to Solve Emotion Detection Tasks [11.928873764689458]
We present a Transformer-based model with a Fusion of Adapter layers to improve the emotion detection task on large scale dataset.
We obtained state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality.
arXiv Detail & Related papers (2021-11-05T20:06:58Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Affective Image Content Analysis: Two Decades Review and New
Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades.
We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence.
We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.