MuCPT: Music-related Natural Language Model Continued Pretraining
- URL: http://arxiv.org/abs/2511.14245v1
- Date: Tue, 18 Nov 2025 08:33:34 GMT
- Title: MuCPT: Music-related Natural Language Model Continued Pretraining
- Authors: Kai Tian, Yirong Mao, Wendong Bi, Hanjie Wang, Que Wenhui,
- Abstract summary: We build a large, music-related natural language corpus (40B tokens) that combines open source and in-house data.<n>We also introduce reference-model (RM)-based token-level soft scoring for quality control.<n>Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework.
- Score: 2.2288022262475873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.
Related papers
- RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation [78.01030342481246]
RecBase is a domain-agnostic foundational model pretrained with a recommendation-oriented objective.<n>We introduce a unified item tokenizer that encodes items into hierarchical concept identifiers.<n>Our model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.
arXiv Detail & Related papers (2025-09-03T08:33:43Z) - MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation [6.903890310699392]
MusT-RAG is a comprehensive framework based on Retrieval Augmented Generation (RAG)<n>MusWikiDB is a music-specialized vector database for the retrieval stage.<n>Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities.
arXiv Detail & Related papers (2025-07-31T08:31:05Z) - Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning [10.558648773612191]
We propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically.<n>We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training.
arXiv Detail & Related papers (2025-05-29T09:50:07Z) - CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining [15.58671300364536]
This paper presents a novel cross-modal contrastive learning framework to guide music similarity modeling.<n>To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach.<n>Experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks.
arXiv Detail & Related papers (2025-03-29T15:43:09Z) - Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation [72.28364940168092]
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries.<n>We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - In-depth analysis of music structure as a text network [7.735597173716555]
We focus on the fundamental elements of music and construct an evolutionary network from the perspective of music as a natural language.
We aim to comprehend the structural differences in music across different periods, enabling a more scientific exploration of music.
arXiv Detail & Related papers (2023-03-21T08:39:56Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Towards Context-Aware Neural Performance-Score Synchronisation [2.0305676256390934]
Music synchronisation provides a way to navigate among multiple representations of music in a unified manner.
Traditional synchronisation methods compute alignment using knowledge-driven and performance analysis approaches.
This PhD furthers the development of performance-score synchronisation research by proposing data-driven, context-aware alignment approaches.
arXiv Detail & Related papers (2022-05-31T16:45:25Z) - Unified Instance and Knowledge Alignment Pretraining for Aspect-based
Sentiment Analysis [96.53859361560505]
Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment polarity towards an aspect.
There always exists severe domain shift between the pretraining and downstream ABSA datasets.
We introduce a unified alignment pretraining framework into the vanilla pretrain-finetune pipeline.
arXiv Detail & Related papers (2021-10-26T04:03:45Z) - iFAN: Image-Instance Full Alignment Networks for Adaptive Object
Detection [48.83883375118966]
iFAN aims to precisely align feature distributions on both image and instance levels.
It outperforms state-of-the-art methods with a boost of 10%+ AP over the source-only baseline.
arXiv Detail & Related papers (2020-03-09T13:27:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.