Related papers: Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

URL: http://arxiv.org/abs/2511.10958v1
Date: Fri, 14 Nov 2025 04:49:58 GMT
Title: Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition
Authors: Gunho Jung, Heejo Kong, Seong-Whan Lee,
Abstract summary: Dynamic facial expression recognition aims to identify emotional states by modeling the temporal changes in facial movements across video sequences.<n>A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label.<n>We propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling.
Score: 49.41688891301643
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

Related papers

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation [34.905479321921575]
We propose VowelPrompt, a framework that augments large language models with interpretable, fine-grained vowel-level prosodic cues.<n>We show that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions.
arXiv Detail & Related papers (2026-02-06T00:09:14Z)
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
We propose the Emotional Rationale Verifier (ERV) and an Explanation Reward.<n>Our method guides the model to produce reasoning that is explicitly consistent with the target emotion.<n>We show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions.
arXiv Detail & Related papers (2025-10-27T16:40:17Z)
From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition [7.362433184546492]
Dynamic Facial Expression Recognition aims to identify human emotions from temporally evolving facial movements.<n>Our method integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient features.
arXiv Detail & Related papers (2025-07-16T04:15:06Z)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding [26.36195886824082]
Emotion-Qwen is a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general reasoning capabilities.<n>We develop the Video Emotion Reasoning dataset, a large-scale bilingual resource containing over 40K video clips annotated with detailed context-aware emotional descriptions.
arXiv Detail & Related papers (2025-05-10T16:15:26Z)
Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition [16.317534822730256]
Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness.<n>Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions.<n>We propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations, physiological signals, and contextual cues into a unified prompting strategy.
arXiv Detail & Related papers (2025-04-24T03:26:30Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations. We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z)
Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild [1.8604727699812171]
Video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios. We propose the global convolution-attention block (GCA) to rescale the channels of the feature maps. In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities.
arXiv Detail & Related papers (2022-08-19T12:48:07Z)
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization. The emotions have extremely varied temporal dynamics. The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)
Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face. PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.