Musical Audio Similarity with Self-supervised Convolutional Neural
Networks
- URL: http://arxiv.org/abs/2202.02112v1
- Date: Fri, 4 Feb 2022 12:51:16 GMT
- Title: Musical Audio Similarity with Self-supervised Convolutional Neural
Networks
- Authors: Carl Thom\'e, Sebastian Piwell, Oscar Utterb\"ack
- Abstract summary: We have built a music similarity search engine that lets video producers search by listenable music excerpts.
Our system suggests similar sounding track segments in a large music catalog by training a self-supervised convolutional neural network.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We have built a music similarity search engine that lets video producers
search by listenable music excerpts, as a complement to traditional full-text
search. Our system suggests similar sounding track segments in a large music
catalog by training a self-supervised convolutional neural network with triplet
loss terms and musical transformations. Semi-structured user interviews
demonstrate that we can successfully impress professional video producers with
the quality of the search experience, and perceived similarities to query
tracks averaged 7.8/10 in user testing. We believe this search tool will make
for a more natural search experience that is easier to find music to soundtrack
videos with.
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention.
Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs.
Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Learning to rank music tracks using triplet loss [6.43271391521664]
We propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks.
We train a Convolutional Neural Network to learn the similarity via triplet loss.
Results highlight the efficiency of our system, especially when associated with an Auto-pooling layer.
arXiv Detail & Related papers (2020-05-18T08:20:54Z) - AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent
Videos with Deep Learning [5.33024001730262]
We present AutoFoley, a fully-automated deep learning tool that can be used to synthesize a representative audio track for videos.
AutoFoley can be used in the applications where there is either no corresponding audio file associated with the video or in cases where there is a need to identify critical scenarios.
Our experiments show that the synthesized sounds are realistically portrayed with accurate temporal synchronization of the associated visual inputs.
arXiv Detail & Related papers (2020-02-21T09:08:28Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z) - Audiovisual SlowFast Networks for Video Recognition [140.08143162600354]
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts.
We report results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.
arXiv Detail & Related papers (2020-01-23T18:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.