MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers
- URL: http://arxiv.org/abs/2511.04376v1
- Date: Thu, 06 Nov 2025 14:01:52 GMT
- Title: MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers
- Authors: Ali Boudaghi, Hadi Zare,
- Abstract summary: MusRec is the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively.<n> Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity.
- Score: 3.096755173613532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. Leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.
Related papers
- MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation [30.88898550337434]
Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development.<n>Many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing.<n>We introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets.
arXiv Detail & Related papers (2025-12-16T17:44:56Z) - MotionEdit: Benchmarking and Learning Motion-Centric Image Editing [81.28392925790568]
We introduce MotionEdit, a novel dataset for motion-centric image editing.<n>MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted from continuous videos.<n>We propose MotionNFT to compute motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion.
arXiv Detail & Related papers (2025-12-11T04:53:58Z) - O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing [88.93410369258203]
O-DisCo-Edit is a unified framework that incorporates a novel object distortion control (O-DisCo)<n>This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation.<n>O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm.
arXiv Detail & Related papers (2025-09-01T16:29:39Z) - EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models [13.29289368130043]
In this paper, we explore the application of existing text-to-music diffusion models for instrument editing.<n>Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content.<n>Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.
arXiv Detail & Related papers (2025-06-18T15:01:25Z) - Not that Groove: Zero-Shot Symbolic Music Editing [4.897267974042842]
We are among the first to tackle symbolic music editing.<n>We prove that LLMs with zero-shot prompting can effectively edit drum grooves.<n>The recipe of success is a creatively designed format that interfaces LLMs and music.
arXiv Detail & Related papers (2025-05-13T03:33:36Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions.<n>Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z) - MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models [24.582948932985726]
This paper introduces a novel approach to the editing of music generated by text-to-music models.
Our method transforms text editing to textitlatent space manipulation while adding an extra constraint to enforce consistency.
Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations.
arXiv Detail & Related papers (2024-02-09T04:34:08Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.