How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
- URL: http://arxiv.org/abs/2210.10039v1
- Date: Tue, 18 Oct 2022 17:58:25 GMT
- Title: How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
- Authors: Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan,
Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
- Abstract summary: We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
- Score: 73.24092762346095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, deep neural networks have demonstrated increasingly strong
abilities to recognize objects and activities in videos. However, as video
understanding becomes widely used in real-world applications, a key
consideration is developing human-centric systems that understand not only the
content of the video but also how it would affect the wellbeing and emotional
state of viewers. To facilitate research in this setting, we introduce two
large-scale datasets with over 60,000 videos manually annotated for emotional
response and subjective wellbeing. The Video Cognitive Empathy (VCE) dataset
contains annotations for distributions of fine-grained emotional responses,
allowing models to gain a detailed understanding of affective states. The Video
to Valence (V2V) dataset contains annotations of relative pleasantness between
videos, which enables predicting a continuous spectrum of wellbeing. In
experiments, we show how video models that are primarily trained to recognize
actions and find contours of objects can be repurposed to understand human
preferences and the emotional content of videos. Although there is room for
improvement, predicting wellbeing and emotional response is on the horizon for
state-of-the-art models. We hope our datasets can help foster further advances
at the intersection of commonsense video understanding and human preference
learning.
Related papers
- eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos [7.011656298079659]
The prevailing use of short videos (SVs) leads to the necessity of emotion recognition in SVs.
Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.
We present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations.
arXiv Detail & Related papers (2023-11-29T03:24:30Z) - Contextual Explainable Video Representation:\\Human Perception-based
Understanding [10.172332586182792]
We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment.
We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
arXiv Detail & Related papers (2022-12-12T19:29:07Z) - Affection: Learning Affective Explanations for Real-World Visual Data [50.28825017427716]
We introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly available images.
We show that there is significant common ground to capture potentially plausible emotional responses with a large support in the subject population.
Our work paves the way for richer, more human-centric, and emotionally-aware image analysis systems.
arXiv Detail & Related papers (2022-10-04T22:44:17Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Use of Affective Visual Information for Summarization of Human-Centric
Videos [13.273989782771556]
We investigate the affective-information enriched supervised video summarization task for human-centric videos.
First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes.
Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM)
arXiv Detail & Related papers (2021-07-08T11:46:04Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System [8.900866276512364]
Current approaches overlook the video's emotional characteristics in the music generation step.
We propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion.
Our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets.
arXiv Detail & Related papers (2020-04-05T07:18:28Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.