MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues
- URL: http://arxiv.org/abs/2407.16552v2
- Date: Wed, 24 Jul 2024 01:09:36 GMT
- Title: MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues
- Authors: Liyun Zhang,
- Abstract summary: We propose a time-sensitive Multimodal Large Language Model (MLLM) aimed at directing attention to the local facial micro-expression dynamics.
Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; and (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal emotion recognition capabilities, integrating multimodal cues from visual, acoustic, and linguistic contexts in the video to recognize human emotional states. However, existing methods ignore capturing local facial features of temporal dynamics of micro-expressions and do not leverage the contextual dependencies of the utterance-aware temporal segments in the video, thereby limiting their expected effectiveness to a certain extent. In this work, we propose MicroEmo, a time-sensitive MLLM aimed at directing attention to the local facial micro-expression dynamics and the contextual dependencies of utterance-aware video clips. Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them. Preliminary qualitative experiments demonstrate that in a new Explainable Multimodal Emotion Recognition (EMER) task that exploits multi-modal and multi-faceted clues to predict emotions in an open-vocabulary (OV) manner, MicroEmo demonstrates its effectiveness compared with the latest methods.
Related papers
- Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis [6.387263468033964]
We introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations.
In addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM.
Our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
arXiv Detail & Related papers (2025-01-16T12:27:05Z) - VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information.
Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities.
By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.
HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.
MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks.
But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored.
EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z) - FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video
Emotion Recognition Inference [6.279057784373124]
In this paper, we design a fully multimodal video-to-emotion system (FV2ES) for fast yet effective recognition inference.
The adoption of the hierarchical attention method upon the sound spectra breaks through the limited contribution of the acoustic modality.
The further integration of data pre-processing into the aligned multimodal learning model allows the significant reduction of computational costs and storage space.
arXiv Detail & Related papers (2022-09-21T08:05:26Z) - Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization.
The emotions have extremely varied temporal dynamics.
The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Recognizing Micro-Expression in Video Clip with Adaptive Key-Frame
Mining [18.34213657996624]
In micro-expression, facial movement is transient and sparsely localized through time.
We propose a novel end-to-end deep learning architecture, referred to as adaptive key-frame mining network (AKMNet)
AKMNet is able to learn discriminative-temporal representation by combining spatial features of self-learned local key frames and their global-temporal dynamics.
arXiv Detail & Related papers (2020-09-19T07:03:16Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.