Interpreting Song Lyrics with an Audio-Informed Pre-trained Language
Model
- URL: http://arxiv.org/abs/2208.11671v1
- Date: Wed, 24 Aug 2022 17:07:37 GMT
- Title: Interpreting Song Lyrics with an Audio-Informed Pre-trained Language
Model
- Authors: Yixiao Zhang, Junyan Jiang, Gus Xia, Simon Dixon
- Abstract summary: BART-fusion is a novel model for generating lyric interpretations from lyrics and music audio.
We employ a cross-modal attention module to incorporate the audio representation into the lyrics representation.
We show that the additional audio information helps our model to understand words and music better.
- Score: 12.19432397758502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lyric interpretations can help people understand songs and their lyrics
quickly, and can also make it easier to manage, retrieve and discover songs
efficiently from the growing mass of music archives. In this paper we propose
BART-fusion, a novel model for generating lyric interpretations from lyrics and
music audio that combines a large-scale pre-trained language model with an
audio encoder. We employ a cross-modal attention module to incorporate the
audio representation into the lyrics representation to help the pre-trained
language model understand the song from an audio perspective, while preserving
the language model's original generative performance. We also release the Song
Interpretation Dataset, a new large-scale dataset for training and evaluating
our model. Experimental results show that the additional audio information
helps our model to understand words and music better, and to generate precise
and fluent interpretations. An additional experiment on cross-modal music
retrieval shows that interpretations generated by BART-fusion can also help
people retrieve music more accurately than with the original BART.
Related papers
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - SongComposer: A Large Language Model for Lyric and Melody Composition in
Song Generation [88.33522730306674]
SongComposer could understand and generate melodies and lyrics in symbolic song representations.
We resort to symbolic song representation, the mature and efficient way humans designed for music.
With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation.
arXiv Detail & Related papers (2024-02-27T16:15:28Z) - The Song Describer Dataset: a Corpus of Audio Captions for
Music-and-Language Evaluation [18.984512029792235]
We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs.
The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings.
arXiv Detail & Related papers (2023-11-16T17:52:21Z) - LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT [48.28624219567131]
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method.
We use Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model.
Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English.
arXiv Detail & Related papers (2023-06-29T17:01:51Z) - Language-Guided Music Recommendation for Video via Prompt Analogies [35.48998901411509]
We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language.
Existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
arXiv Detail & Related papers (2023-06-15T17:58:01Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - The Contribution of Lyrics and Acoustics to Collaborative Understanding
of Mood [7.426508199697412]
We study the association between song lyrics and mood through a data-driven analysis.
Our data set consists of nearly one million songs, with song-mood associations derived from user playlists on the Spotify streaming platform.
We take advantage of state-of-the-art natural language processing models based on transformers to learn the association between the lyrics and moods.
arXiv Detail & Related papers (2022-05-31T19:58:41Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.