Related papers: JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

URL: http://arxiv.org/abs/2502.07461v1
Date: Tue, 11 Feb 2025 11:12:19 GMT
Title: JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata
Authors: Abhinaba Roy, Renhang Liu, Tongyu Lu, Dorien Herremans,
Abstract summary: We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform.<n>The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata.
Score: 6.230204066837519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.

Related papers

SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling [0.0]
To the best of our knowledge, there are no open-source high-quality datasets representing popular and well-known songs for generative music modeling tasks.<n>Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
arXiv Detail & Related papers (2025-06-17T08:08:08Z)
Can Impressions of Music be Extracted from Thumbnail Images? [20.605634973566573]
There is a scarcity of large-scale publicly available datasets consisting of music data and their corresponding natural language descriptions known as music captions.<n>We propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images.<n>We created a dataset with approximately 360,000 captions containing non-musical aspects and trained a music retrieval model.
arXiv Detail & Related papers (2025-01-05T11:51:38Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions. We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing [3.3162176082220975]
We present the MOSA (Music mOtion with Semantic ) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date.
arXiv Detail & Related papers (2024-06-10T15:37:46Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses. We train it on an extensive music caption dataset and fine-tune it with instructional data. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z)
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning [37.76488341368786]
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. We propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. We present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA dataset.
arXiv Detail & Related papers (2023-08-22T08:43:33Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains. This will also provide a means for evaluating algorithms specifically designed for music. The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.