MULTIMODAL ANALYSIS: Informed content estimation and audio source
separation
- URL: http://arxiv.org/abs/2104.13276v1
- Date: Tue, 27 Apr 2021 15:45:21 GMT
- Title: MULTIMODAL ANALYSIS: Informed content estimation and audio source
separation
- Authors: Gabriel Meseguer-Brocal
- Abstract summary: The singing voice directly connects the audio signal and the text information in a unique way.
Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This dissertation proposes the study of multimodal learning in the context of
musical signals. Throughout, we focus on the interaction between audio signals
and text information. Among the many text sources related to music that can be
used (e.g. reviews, metadata, or social network feedback), we concentrate on
lyrics. The singing voice directly connects the audio signal and the text
information in a unique way, combining melody and lyrics where a linguistic
dimension complements the abstraction of musical instruments. Our study focuses
on the audio and lyrics interaction for targeting source separation and
informed content estimation.
Related papers
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Investigating Personalization Methods in Text to Music Generation [21.71190700761388]
Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods.
For evaluation, we construct a novel dataset with prompts and music clips.
Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody.
arXiv Detail & Related papers (2023-09-20T08:36:34Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - Learning in Audio-visual Context: A Review, Analysis, and New
Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field.
We introduce several key findings that have inspired our computational studies.
We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Musical Word Embedding: Bridging the Gap between Listening Contexts and
Music [5.89179309980335]
We train the distributed representation of words using combinations of both general text data and music-specific data.
We evaluate the system in terms of how they associate listening contexts with musical compositions.
arXiv Detail & Related papers (2020-07-23T06:42:45Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.