Contextual Emotion Recognition using Large Vision Language Models
- URL: http://arxiv.org/abs/2405.08992v1
- Date: Tue, 14 May 2024 23:24:12 GMT
- Title: Contextual Emotion Recognition using Large Vision Language Models
- Authors: Yasaman Etesam, Özge Nilay Yalçın, Chuxuan Zhang, Angelica Lim,
- Abstract summary: Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
- Score: 0.6749750044497732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: "How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
Related papers
- MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions.
Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones.
Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z) - How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations [5.895694050664867]
We introduce a novel approach for facial expression classification that goes beyond simple classification tasks.
Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context.
We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations.
arXiv Detail & Related papers (2024-09-04T09:32:40Z) - Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning [0.6749750044497732]
We propose methods to incorporate the emotional reasoning capabilities by constructing "narrative captions" relevant to emotion perception.
We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors.
Our experiments showed that combining the "Fast" narrative descriptors and "Slow" reasoning of language models is a promising way to achieve emotional theory of mind.
arXiv Detail & Related papers (2023-10-30T20:26:12Z) - Contextual Emotion Estimation from Image Captions [0.6749750044497732]
We explore whether Large Language Models can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference.
We generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset.
We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations.
arXiv Detail & Related papers (2023-09-22T18:44:34Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Affection: Learning Affective Explanations for Real-World Visual Data [50.28825017427716]
We introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly available images.
We show that there is significant common ground to capture potentially plausible emotional responses with a large support in the subject population.
Our work paves the way for richer, more human-centric, and emotionally-aware image analysis systems.
arXiv Detail & Related papers (2022-10-04T22:44:17Z) - Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks.
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models.
Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z) - Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions.
Our framework integrates a contextualized embedding encoder with a multi-head probing model.
Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z) - A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task.
We ground our intuition on the observation that often faces images are acquired at different resolutions.
To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z) - Vision and Language: from Visual Perception to Content Creation [100.36776435627962]
"vision to language" is probably one of the most popular topics in the past five years.
This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
arXiv Detail & Related papers (2019-12-26T14:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.