Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models
- URL: http://arxiv.org/abs/2506.15530v1
- Date: Wed, 18 Jun 2025 15:01:25 GMT
- Title: Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models
- Authors: Teysir Baoueb, Xiaoyu Bie, Xi Wang, Gaƫl Richard,
- Abstract summary: In this paper, we explore the application of existing text-to-music diffusion models for instrument editing.<n>Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content.<n>Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.
- Score: 13.29289368130043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a well-chosen intermediate timestep, identified through an instrument classifier, yields a balance between preserving the original piece's content and achieving the desired timbre. Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.
Related papers
- JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment [26.590667516155083]
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation.<n>Recent open lyrics-to-song models have set an acceptable standard in automatic song generation for recreational use.<n>Our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation.
arXiv Detail & Related papers (2025-07-28T14:34:02Z) - EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation [49.062766449989525]
Generative models of music audio are typically used to generate output based solely on a text prompt or melody.<n>Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model.
arXiv Detail & Related papers (2025-07-07T10:46:07Z) - SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation.<n>It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately.<n>To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z) - Subtractive Training for Music Stem Insertion using Latent Diffusion Models [35.91945598575059]
We present Subtractive Training, a method for synthesizing individual musical instrument stems given other instruments as context.<n>Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks.<n>We extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
arXiv Detail & Related papers (2024-06-27T16:59:14Z) - Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions.
Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z) - MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models [24.582948932985726]
This paper introduces a novel approach to the editing of music generated by text-to-music models.
Our method transforms text editing to textitlatent space manipulation while adding an extra constraint to enforce consistency.
Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations.
arXiv Detail & Related papers (2024-02-09T04:34:08Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models.
Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process.
We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.