Related papers: BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

URL: http://arxiv.org/abs/2503.23990v1
Date: Mon, 31 Mar 2025 12:04:53 GMT
Title: BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation
Authors: Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Yulin Wu, Bingquan Liu,
Abstract summary: We propose a behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors into a vanilla MLLM-based MERC model.<n>BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets.
Score: 29.514459004019024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.

Related papers

MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering [55.30507585676142]
Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion.<n>In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation.<n>The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions.
arXiv Detail & Related papers (2025-06-18T09:29:51Z)
MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception [47.80768014770871]
We propose a novel Micro-Expression Large Language Model (MELLM)<n>It incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs.<n>Our model exhibits superior robustness and generalization capabilities in micro-expression understanding (MEU)
arXiv Detail & Related papers (2025-05-11T15:08:23Z)
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation [58.189703277322224]
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion. Emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. We propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation.
arXiv Detail & Related papers (2025-04-08T04:34:38Z)
Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis [6.387263468033964]
We introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations.<n>In addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM.<n>Our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
arXiv Detail & Related papers (2025-01-16T12:27:05Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions. Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z)
MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues [0.0]
We propose a time-sensitive Multimodal Large Language Model (MLLM) aimed at directing attention to the local facial micro-expression dynamics. Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; and (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them.
arXiv Detail & Related papers (2024-07-23T15:05:55Z)
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks. But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z)
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z)
BLSP-Emo: Towards Empathetic Large Speech-Language Models [34.62210186235263]
We present BLSP-Emo, a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses.
arXiv Detail & Related papers (2024-06-06T09:02:31Z)
LaERC-S: Improving LLM-based Emotion Recognition in Conversation with Speaker Characteristics [25.284238441231853]
Emotion recognition in conversation (ERC) is the task of discerning human emotions for each utterance within a conversation.<n>Recent research in ERC has sought to exploit pre-trained large language models (LLMs) with speaker modelling to comprehend emotional states.<n>We present LaERC-S, a novel framework that stimulates LLMs to explore speaker characteristics involving the mental state and behavior of interlocutors.
arXiv Detail & Related papers (2024-03-12T02:37:11Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.