MuLan: A Joint Embedding of Music Audio and Natural Language
- URL: http://arxiv.org/abs/2208.12415v1
- Date: Fri, 26 Aug 2022 03:13:21 GMT
- Title: MuLan: A Joint Embedding of Music Audio and Natural Language
- Authors: Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li,
Daniel P. W. Ellis
- Abstract summary: This paper presents a new generation of models that link audio annotations directly to natural language descriptions.
MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings.
- Score: 15.753767984842014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music tagging and content-based retrieval systems have traditionally been
constructed using pre-defined ontologies covering a rigid set of music
attributes or text queries. This paper presents MuLan: a first attempt at a new
generation of acoustic models that link music audio directly to unconstrained
natural language music descriptions. MuLan takes the form of a two-tower, joint
audio-text embedding model trained using 44 million music recordings (370K
hours) and weakly-associated, free-form text annotations. Through its
compatibility with a wide range of music genres and text styles (including
conventional music tags), the resulting audio-text representation subsumes
existing ontologies while graduating to true zero-shot functionalities. We
demonstrate the versatility of the MuLan embeddings with a range of experiments
including transfer learning, zero-shot music tagging, language understanding in
the music domain, and cross-modal retrieval applications.
Related papers
- CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models [51.03510073676228]
CLaMP 2 is a system compatible with 101 languages for music information retrieval.
By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale.
CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities.
arXiv Detail & Related papers (2024-10-17T06:43:54Z) - MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models [11.834712543531756]
MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio.
It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets.
We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
arXiv Detail & Related papers (2024-08-02T15:34:05Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Content-based Controls For Music Large Language Modeling [6.17674772485321]
Coco-Mulla is a content-based control method for music large language modeling.
It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models.
Our approach achieves high-quality music generation with low-resource semi-supervised learning.
arXiv Detail & Related papers (2023-10-26T05:24:38Z) - Language-Guided Music Recommendation for Video via Prompt Analogies [35.48998901411509]
We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language.
Existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
arXiv Detail & Related papers (2023-06-15T17:58:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Music-to-Text Synaesthesia: Generating Descriptive Text from Music
Recordings [36.090928638883454]
Music-to-text synaesthesia aims to generate descriptive texts from music recordings with the same sentiment for further understanding.
We build a computational model to generate sentences that can describe the content of the music recording.
To tackle the highly non-discriminative classical music, we design a group topology-preservation loss.
arXiv Detail & Related papers (2022-10-02T06:06:55Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - Genre-conditioned Acoustic Models for Automatic Lyrics Transcription of
Polyphonic Music [73.73045854068384]
We propose to transcribe the lyrics of polyphonic music using a novel genre-conditioned network.
The proposed network adopts pre-trained model parameters, and incorporates the genre adapters between layers to capture different genre peculiarities for lyrics-genre pairs.
Our experiments show that the proposed genre-conditioned network outperforms the existing lyrics transcription systems.
arXiv Detail & Related papers (2022-04-07T09:15:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.