CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining
- URL: http://arxiv.org/abs/2503.23128v1
- Date: Sat, 29 Mar 2025 15:43:09 GMT
- Title: CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining
- Authors: Tristan Tsoi, Jiajun Deng, Yaolong Ju, Benno Weck, Holger Kirchhoff, Simon Lui,
- Abstract summary: This paper presents a novel cross-modal contrastive learning framework to guide music similarity modeling.<n>To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach.<n>Experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks.
- Score: 15.58671300364536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.
Related papers
- Cross-Modal Learning for Music-to-Music-Video Description Generation [22.27153318775917]
Music-to-music-video (MV) generation is a challenging task due to intrinsic differences between the music and video modalities.<n>In this study, we focus on the MV description generation task and propose a comprehensive pipeline.<n>We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset.
arXiv Detail & Related papers (2025-03-14T08:34:28Z) - Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval [7.7464988473650935]
Text-to-Music Retrieval plays a pivotal role in content discovery within extensive music databases.
This paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++.
arXiv Detail & Related papers (2024-10-04T09:33:34Z) - Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.
This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.
We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.
We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses.
We train it on an extensive music caption dataset and fine-tune it with instructional data.
Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z) - MusicLDM: Enhancing Novelty in Text-to-Music Generation Using
Beat-Synchronous Mixup Strategies [32.482588500419006]
We build a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain.
We propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup.
In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music.
arXiv Detail & Related papers (2023-08-03T05:35:37Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - Late multimodal fusion for image and audio music transcription [0.0]
multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities.
We study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems.
Two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.
arXiv Detail & Related papers (2022-04-06T20:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.