JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata
- URL: http://arxiv.org/abs/2502.07461v1
- Date: Tue, 11 Feb 2025 11:12:19 GMT
- Title: JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata
- Authors: Abhinaba Roy, Renhang Liu, Tongyu Lu, Dorien Herremans,
- Abstract summary: We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform.
The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata.
- Score: 6.230204066837519
- License:
- Abstract: We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.
Related papers
- Can Impressions of Music be Extracted from Thumbnail Images? [20.605634973566573]
There is a scarcity of large-scale publicly available datasets consisting of music data and their corresponding natural language descriptions known as music captions.
We propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images.
We created a dataset with approximately 360,000 captions containing non-musical aspects and trained a music retrieval model.
arXiv Detail & Related papers (2025-01-05T11:51:38Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing [3.3162176082220975]
We present the MOSA (Music mOtion with Semantic ) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians.
To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date.
arXiv Detail & Related papers (2024-06-10T15:37:46Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music.
The data is sourced from Wikipedia's rich catalogue of articles covering musical works.
We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses.
We train it on an extensive music caption dataset and fine-tune it with instructional data.
Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z) - Music Understanding LLaMA: Advancing Text-to-Music Generation with
Question Answering and Captioning [37.76488341368786]
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions.
We propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files.
We present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA dataset.
arXiv Detail & Related papers (2023-08-22T08:43:33Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.