Contextual Emotion Estimation from Image Captions
- URL: http://arxiv.org/abs/2309.13136v1
- Date: Fri, 22 Sep 2023 18:44:34 GMT
- Title: Contextual Emotion Estimation from Image Captions
- Authors: Vera Yang, Archita Srivastava, Yasaman Etesam, Chuxuan Zhang, Angelica
Lim
- Abstract summary: We explore whether Large Language Models can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference.
We generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset.
We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations.
- Score: 0.6749750044497732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion estimation in images is a challenging task, typically using computer
vision methods to directly estimate people's emotions using face, body pose and
contextual cues. In this paper, we explore whether Large Language Models (LLMs)
can support the contextual emotion estimation task, by first captioning images,
then using an LLM for inference. First, we must understand: how well do LLMs
perceive human emotions? And which parts of the information enable them to
determine emotions? One initial challenge is to construct a caption that
describes a person within a scene with information relevant for emotion
perception. Towards this goal, we propose a set of natural language descriptors
for faces, bodies, interactions, and environments. We use them to manually
generate captions and emotion annotations for a subset of 331 images from the
EMOTIC dataset. These captions offer an interpretable representation for
emotion estimation, towards understanding how elements of a scene affect
emotion perception in LLMs and beyond. Secondly, we test the capability of a
large language model to infer an emotion from the resulting image captions. We
find that GPT-3.5, specifically the text-davinci-003 model, provides
surprisingly reasonable emotion predictions consistent with human annotations,
but accuracy can depend on the emotion concept. Overall, the results suggest
promise in the image captioning and LLM approach.
Related papers
- Think out Loud: Emotion Deducing Explanation in Dialogues [57.90554323226896]
We propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN)
EDEN recognizes emotion and causes in an explicitly thinking way.
It can help Large Language Models (LLMs) achieve better recognition of emotions and causes.
arXiv Detail & Related papers (2024-06-07T08:58:29Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning [0.6749750044497732]
We propose methods to incorporate the emotional reasoning capabilities by constructing "narrative captions" relevant to emotion perception.
We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors.
Our experiments showed that combining the "Fast" narrative descriptors and "Slow" reasoning of language models is a promising way to achieve emotional theory of mind.
arXiv Detail & Related papers (2023-10-30T20:26:12Z) - High-Level Context Representation for Emotion Recognition in Images [4.987022981158291]
We propose an approach for high-level context representation extraction from images.
The model relies on a single cue and a single encoding stream to correlate this representation with emotions.
Our approach is more efficient than previous models and can be easily deployed to address real-world problems related to emotion recognition.
arXiv Detail & Related papers (2023-05-05T13:20:41Z) - Contextually-rich human affect perception using multimodal scene
information [36.042369831043686]
We leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images.
We propose a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction.
We show the effectiveness of our proposed modular design on two datasets associated with natural scenes and TV shows.
arXiv Detail & Related papers (2023-03-13T07:46:41Z) - PERI: Part Aware Emotion Recognition In The Wild [4.206175795966693]
This paper focuses on emotion recognition using visual features.
We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks.
We provide our results on the publicly available in the wild EMOTIC dataset.
arXiv Detail & Related papers (2022-10-18T20:01:40Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions.
Our framework integrates a contextualized embedding encoder with a multi-head probing model.
Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - Annotation of Emotion Carriers in Personal Narratives [69.07034604580214]
We are interested in the problem of understanding personal narratives (PN) - spoken or written - recollections of facts, events, and thoughts.
In PN, emotion carriers are the speech or text segments that best explain the emotional state of the user.
This work proposes and evaluates an annotation model for identifying emotion carriers in spoken personal narratives.
arXiv Detail & Related papers (2020-02-27T15:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.