Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis
- URL: http://arxiv.org/abs/2002.00251v1
- Date: Sat, 1 Feb 2020 17:57:14 GMT
- Title: Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis
- Authors: Alexander Schindler
- Abstract summary: This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
- Score: 91.3755431537592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This thesis combines audio-analysis with computer vision to approach Music
Information Retrieval (MIR) tasks from a multi-modal perspective. This thesis
focuses on the information provided by the visual layer of music videos and how
it can be harnessed to augment and improve tasks of the MIR research domain.
The main hypothesis of this work is based on the observation that certain
expressive categories such as genre or theme can be recognized on the basis of
the visual content alone, without the sound being heard. This leads to the
hypothesis that there exists a visual language that is used to express mood or
genre. In a further consequence it can be concluded that this visual
information is music related and thus should be beneficial for the
corresponding MIR tasks such as music genre classification or mood recognition.
A series of comprehensive experiments and evaluations are conducted which are
focused on the extraction of visual information and its application in
different MIR tasks. A custom dataset is created, suitable to develop and test
visual features which are able to represent music related information.
Evaluations range from low-level visual features to high-level concepts
retrieved by means of Deep Convolutional Neural Networks. Additionally, new
visual features are introduced capturing rhythmic visual patterns. In all of
these experiments the audio-based results serve as benchmark for the visual and
audio-visual approaches. The experiments are conducted for three MIR tasks
Artist Identification, Music Genre Classification and Cross-Genre
Classification. Experiments show that an audio-visual approach harnessing
high-level semantic information gained from visual concept detection,
outperforms audio-only genre-classification accuracy by 16.43%.
Related papers
- Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings [10.302353984541497]
This research develops a model capable of generating music that resonates with the emotions depicted in visual arts.
Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music dataset.
Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data.
arXiv Detail & Related papers (2024-09-12T08:19:25Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Predicting emotion from music videos: exploring the relative
contribution of visual and auditory information to affective responses [0.0]
We present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis.
The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual.
arXiv Detail & Related papers (2022-02-19T07:36:43Z) - An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene
Classification [58.720142291102135]
This paper presents a task of audio-visual scene classification (SC)
In this task, input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'
arXiv Detail & Related papers (2021-12-16T19:48:32Z) - Multi-task Learning with Metadata for Music Mood Classification [0.0]
Mood recognition is an important problem in music informatics and has key applications in music discovery and recommendation.
We propose a multi-task learning approach in which a shared model is simultaneously trained for mood and metadata prediction tasks.
Applying our technique on the existing state-of-the-art convolutional neural networks for mood classification improves their performances consistently.
arXiv Detail & Related papers (2021-10-10T11:36:34Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.