ALCAP: Alignment-Augmented Music Captioner
- URL: http://arxiv.org/abs/2212.10901v3
- Date: Sun, 22 Oct 2023 02:12:16 GMT
- Title: ALCAP: Alignment-Augmented Music Captioner
- Authors: Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman,
Xuchen Song
- Abstract summary: We introduce a method to learn multimodal alignment between audio and lyrics through contrastive learning.
This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence.
- Score: 34.85003676798762
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Music captioning has gained significant attention in the wake of the rising
prominence of streaming media platforms. Traditional approaches often
prioritize either the audio or lyrics aspect of the music, inadvertently
ignoring the intricate interplay between the two. However, a comprehensive
understanding of music necessitates the integration of both these elements. In
this study, we delve into this overlooked realm by introducing a method to
systematically learn multimodal alignment between audio and lyrics through
contrastive learning. This not only recognizes and emphasizes the synergy
between audio and lyrics but also paves the way for models to achieve deeper
cross-modal coherence, thereby producing high-quality captions. We provide both
theoretical and empirical results demonstrating the advantage of the proposed
method, which achieves new state-of-the-art on two music captioning datasets.
Related papers
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - Contrastive Learning with Positive-Negative Frame Mask for Music
Representation [91.44187939465948]
This paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR.
We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.
arXiv Detail & Related papers (2022-03-17T07:11:42Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - MULTIMODAL ANALYSIS: Informed content estimation and audio source
separation [0.0]
The singing voice directly connects the audio signal and the text information in a unique way.
Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.
arXiv Detail & Related papers (2021-04-27T15:45:21Z) - MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention.
Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs.
Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.