Predicting emotion from music videos: exploring the relative
contribution of visual and auditory information to affective responses
- URL: http://arxiv.org/abs/2202.10453v1
- Date: Sat, 19 Feb 2022 07:36:43 GMT
- Title: Predicting emotion from music videos: exploring the relative
contribution of visual and auditory information to affective responses
- Authors: Phoebe Chua (1), Dimos Makris (2), Dorien Herremans (2), Gemma Roig
(3), Kat Agres (4) ((1) Department of Information Systems and Analytics,
National University of Singapore, (2) Singapore University of Technology and
Design, (3) Goethe University Frankfurt, (4) Yong Siew Toh Conservatory of
Music, National University of Singapore)
- Abstract summary: We present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis.
The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although media content is increasingly produced, distributed, and consumed in
multiple combinations of modalities, how individual modalities contribute to
the perceived emotion of a media item remains poorly understood. In this paper
we present MusicVideos (MuVi), a novel dataset for affective multimedia content
analysis to study how the auditory and visual modalities contribute to the
perceived emotion of media. The data were collected by presenting music videos
to participants in three conditions: music, visual, and audiovisual.
Participants annotated the music videos for valence and arousal over time, as
well as the overall emotion conveyed. We present detailed descriptive
statistics for key measures in the dataset and the results of feature
importance analyses for each condition. Finally, we propose a novel transfer
learning architecture to train Predictive models Augmented with Isolated
modality Ratings (PAIR) and demonstrate the potential of isolated modality
ratings for enhancing multimodal emotion recognition. Our results suggest that
perceptions of arousal are influenced primarily by auditory information, while
perceptions of valence are more subjective and can be influenced by both visual
and auditory information. The dataset is made publicly available.
Related papers
- Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Exploring Emotion Expression Recognition in Older Adults Interacting
with a Virtual Coach [22.00225071959289]
EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging.
This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach.
arXiv Detail & Related papers (2023-11-09T18:22:32Z) - Enhancing the Prediction of Emotional Experience in Movies using Deep
Neural Networks: The Significance of Audio and Language [0.0]
Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies.
In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues.
arXiv Detail & Related papers (2023-06-17T17:40:27Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Affective Image Content Analysis: Two Decades Review and New
Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades.
We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence.
We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.