Related papers: Visual and textual prompts for enhancing emotion recognition in video

Visual and textual prompts for enhancing emotion recognition in video

URL: http://arxiv.org/abs/2504.17224v1
Date: Thu, 24 Apr 2025 03:26:30 GMT
Title: Visual and textual prompts for enhancing emotion recognition in video
Authors: Zhifeng Wang, Qixuan Zhang, Peter Zhang, Wenjia Niu, Kaihao Zhang, Ramesh Sankaranarayana, Sabrina Caldwell, Tom Gedeon,
Abstract summary: Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness.<n>Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions.<n>We propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations, physiological signals, and contextual cues into a unified prompting strategy.
Score: 16.317534822730256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others' emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs' video emotion recognition capabilities.

Related papers

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
Salience Adjustment for Context-Based Emotion Recognition [4.684464105981824]
This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs)<n>We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner's dilemma scenarios.
arXiv Detail & Related papers (2025-07-17T20:55:20Z)
Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation [7.362433184546492]
Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence.<n>This study proposes the Think-Before-Draw framework to address two key challenges.
arXiv Detail & Related papers (2025-07-17T03:33:46Z)
From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition [7.362433184546492]
Dynamic Facial Expression Recognition aims to identify human emotions from temporally evolving facial movements.<n>Our method integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient features.
arXiv Detail & Related papers (2025-07-16T04:15:06Z)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics. We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions.<n>We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
When Words Smile: Generating Diverse Emotional Facial Expressions from Text [72.19705878257204]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z)
Visual Prompting in LLMs for Enhancing Emotion Recognition [10.608029430740364]
Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. We propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely.
arXiv Detail & Related papers (2024-10-03T06:33:43Z)
In-Depth Analysis of Emotion Recognition through Knowledge-Based Large Language Models [3.8153944233011385]
This paper contributes to the emerging field of context-based emotion recognition. We propose an approach that combines emotion recognition methods with Bayesian Cue Integration. We test this approach in the context of interpreting facial expressions during a social task, the prisoner's dilemma.
arXiv Detail & Related papers (2024-07-17T06:39:51Z)
Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.<n>In this paper, we examine two major approaches enabled by recent large vision language models.<n>We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z)
Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z)
SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images. To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features. We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z)
Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction. To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network. Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z)
Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion Recognition [2.1485350418225244]
Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis. We propose a new deep learning-based approach for audio-visual emotion recognition.
arXiv Detail & Related papers (2021-03-16T15:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.