Related papers: Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

URL: http://arxiv.org/abs/2507.21015v1
Date: Mon, 28 Jul 2025 17:28:08 GMT
Title: Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Authors: Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Leppänen, Guoying Zhao,
Abstract summary: We introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples.<n>We propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module.
Score: 39.81062289449454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.

Related papers

CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation [3.5418954219513625]
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories.<n>We propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability.<n>To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images.
arXiv Detail & Related papers (2025-08-05T15:04:34Z)
Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation [7.362433184546492]
Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence.<n>This study proposes the Think-Before-Draw framework to address two key challenges.
arXiv Detail & Related papers (2025-07-17T03:33:46Z)
From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition [7.362433184546492]
Dynamic Facial Expression Recognition aims to identify human emotions from temporally evolving facial movements.<n>Our method integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient features.
arXiv Detail & Related papers (2025-07-16T04:15:06Z)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis [34.100793905255955]
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms.<n>Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content.<n>We present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects.
arXiv Detail & Related papers (2025-04-22T12:43:37Z)
Beyond Vision: How Large Language Models Interpret Facial Expressions from Valence-Arousal Values [6.987852837732702]
Large Language Models primarily operate through text-based inputs and outputs, yet human emotion is communicated through both verbal and non-verbal cues, including facial expressions.<n>This study investigates whether LLMs can infer affective meaning from dimensions of facial expressions-Valence and Arousal values, structured numerical representations, rather than using raw visual input.
arXiv Detail & Related papers (2025-02-08T09:54:03Z)
When Words Smile: Generating Diverse Emotional Facial Expressions from Text [72.19705878257204]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.<n>First, we prompt VLLMs to generate natural language descriptions of the subject's apparent emotion in relation to the visual context.<n>Second, the descriptions, along with the visual input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z)
Learning to Augment Expressions for Few-shot Fine-grained Facial Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED. It includes more than 200k images with 54 facial expressions from 119 persons. Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning. We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.