Reading Between the Frames: Multi-Modal Depression Detection in Videos
from Non-Verbal Cues
- URL: http://arxiv.org/abs/2401.02746v1
- Date: Fri, 5 Jan 2024 10:47:42 GMT
- Title: Reading Between the Frames: Multi-Modal Depression Detection in Videos
from Non-Verbal Cues
- Authors: David Gimeno-G\'omez, Ana-Maria Bucur, Adrian Cosma, Carlos-David
Mart\'inez-Hinarejos, Paolo Rosso
- Abstract summary: Depression, a prominent contributor to global disability, affects a substantial portion of the population.
Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content.
We propose a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos.
- Score: 11.942057763913208
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Depression, a prominent contributor to global disability, affects a
substantial portion of the population. Efforts to detect depression from social
media texts have been prevalent, yet only a few works explored depression
detection from user-generated video content. In this work, we address this
research gap by proposing a simple and flexible multi-modal temporal model
capable of discerning non-verbal depression cues from diverse modalities in
noisy, real-world videos. We show that, for in-the-wild videos, using
additional high-level non-verbal cues is crucial to achieving good performance,
and we extracted and processed audio speech embeddings, face emotion
embeddings, face, body and hand landmarks, and gaze and blinking information.
Through extensive experiments, we show that our model achieves state-of-the-art
results on three key benchmark datasets for depression detection from video by
a substantial margin. Our code is publicly available on GitHub.
Related papers
- AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Depression detection in social media posts using affective and social
norm features [84.12658971655253]
We propose a deep architecture for depression detection from social media posts.
We incorporate profanity and morality features of posts and words in our architecture using a late fusion scheme.
The inclusion of the proposed features yields state-of-the-art results in both settings.
arXiv Detail & Related papers (2023-03-24T21:26:27Z) - It's Just a Matter of Time: Detecting Depression with Time-Enriched
Multimodal Transformers [24.776445591293186]
We propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts.
Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings.
We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets.
arXiv Detail & Related papers (2023-01-13T09:40:19Z) - Semantic Similarity Models for Depression Severity Estimation [53.72188878602294]
This paper presents an efficient semantic pipeline to study depression severity in individuals based on their social media writings.
We use test user sentences for producing semantic rankings over an index of representative training sentences corresponding to depressive symptoms and severity levels.
We evaluate our methods on two Reddit-based benchmarks, achieving 30% improvement over state of the art in terms of measuring depression severity.
arXiv Detail & Related papers (2022-11-14T18:47:26Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - A Psychologically Informed Part-of-Speech Analysis of Depression in
Social Media [1.7188280334580193]
We use the depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018.
Our results reveal statistically significant differences between the depressed and non-depressed individuals.
Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms.
arXiv Detail & Related papers (2021-07-31T16:23:22Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Multimodal Depression Severity Prediction from medical bio-markers using
Machine Learning Tools and Technologies [0.0]
Depression has been a leading cause of mental-health illnesses across the world.
Using behavioural cues to automate depression diagnosis and stage prediction in recent years has relatively increased.
The absence of labelled behavioural datasets and a vast amount of possible variations prove to be a major challenge in accomplishing the task.
arXiv Detail & Related papers (2020-09-11T20:44:28Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.