Contextually-rich human affect perception using multimodal scene
information
- URL: http://arxiv.org/abs/2303.06904v1
- Date: Mon, 13 Mar 2023 07:46:41 GMT
- Title: Contextually-rich human affect perception using multimodal scene
information
- Authors: Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan
- Abstract summary: We leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images.
We propose a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction.
We show the effectiveness of our proposed modular design on two datasets associated with natural scenes and TV shows.
- Score: 36.042369831043686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The process of human affect understanding involves the ability to infer
person specific emotional states from various sources including images, speech,
and language. Affect perception from images has predominantly focused on
expressions extracted from salient face crops. However, emotions perceived by
humans rely on multiple contextual cues including social settings, foreground
interactions, and ambient visual scenes. In this work, we leverage pretrained
vision-language (VLN) models to extract descriptions of foreground context from
images. Further, we propose a multimodal context fusion (MCF) module to combine
foreground cues with the visual scene and person-based contextual information
for emotion prediction. We show the effectiveness of our proposed modular
design on two datasets associated with natural scenes and TV shows.
Related papers
- How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations [5.895694050664867]
We introduce a novel approach for facial expression classification that goes beyond simple classification tasks.
Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context.
We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations.
arXiv Detail & Related papers (2024-09-04T09:32:40Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Contextual Emotion Estimation from Image Captions [0.6749750044497732]
We explore whether Large Language Models can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference.
We generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset.
We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations.
arXiv Detail & Related papers (2023-09-22T18:44:34Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - ArtEmis: Affective Language for Visual Art [46.643106054408285]
We focus on the affective experience triggered by visual artworks.
We ask the annotators to indicate the dominant emotion they feel for a given image.
This leads to a rich set of signals for both the objective content and the affective impact of an image.
arXiv Detail & Related papers (2021-01-19T01:03:40Z) - Context Based Emotion Recognition using EMOTIC Dataset [22.631542327834595]
We present EMOTIC, a dataset of images of people annotated with their apparent emotion.
Using the EMOTIC dataset we train different CNN models for emotion recognition.
Our results show how scene context provides important information to automatically recognize emotional states.
arXiv Detail & Related papers (2020-03-30T12:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.