Related papers: EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

URL: http://arxiv.org/abs/2405.00574v1
Date: Wed, 1 May 2024 15:25:54 GMT
Title: EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model
Authors: Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen,
Abstract summary: We construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD. We also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state.
Score: 22.292581935835678
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

Related papers

AI with Emotions: Exploring Emotional Expressions in Large Language Models [0.0]
Large Language Models (LLMs) play role-play as agents answering questions with specified emotional states. Russell's Circumplex model characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes. evaluation showed that the emotional states of the generated answers were consistent with the specifications.
arXiv Detail & Related papers (2025-04-20T18:49:25Z)
Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis [6.387263468033964]
We introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations. In addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM. Our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
arXiv Detail & Related papers (2025-01-16T12:27:05Z)
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation [39.30784838378127]
The generation of talking avatars has achieved significant advancements in precise audio synchronization. Current methods face fundamental challenges, including the lack of frameworks for modeling single basic emotional expressions. We propose the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states. In conjunction with the DH-FaceEmoVid-150 dataset, we demonstrate that the MoEE framework excels in generating complex emotional expressions and nuanced facial details.
arXiv Detail & Related papers (2025-01-03T13:43:21Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs. HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects. A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions. Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z)
Generative Emotion Cause Explanation in Multimodal Conversations [22.476961519338474]
We introduce a task-textbfMultimodal Emotion Cause Explanation in Conversation (MECEC).<n>This task aims to generate a summary based on the multimodal context of conversations, clearly and intuitively describing the reasons that trigger a given emotion.<n>A novel approach, FAME-Net, is proposed, that harnesses the power of Large Language Models (LLMs) to analyze visual data and accurately interpret the emotions conveyed through facial expressions in videos.
arXiv Detail & Related papers (2024-11-01T09:16:30Z)
AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models [18.482881562645264]
This study is the first to explore the potential of Large Language Models (LLMs) in recognizing ambiguous emotions. We design zero-shot and few-shot prompting and incorporate past dialogue as context information for ambiguous emotion recognition.
arXiv Detail & Related papers (2024-09-26T23:25:21Z)
EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation. It enables fine-grained control of emotion categories and intensities. It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z)
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks. But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z)
Think out Loud: Emotion Deducing Explanation in Dialogues [57.90554323226896]
We propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN) EDEN recognizes emotion and causes in an explicitly thinking way. It can help Large Language Models (LLMs) achieve better recognition of emotions and causes.
arXiv Detail & Related papers (2024-06-07T08:58:29Z)
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization. The emotions have extremely varied temporal dynamics. The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)
Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction. To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network. Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z)
Emotion Recognition from Multiple Modalities: Fundamentals and Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER) We begin with a brief introduction on widely used emotion representation models and affective modalities. We then summarize existing emotion annotation strategies and corresponding computational tasks. Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.