Stemphonic: All-at-once Flexible Multi-stem Music Generation
- URL: http://arxiv.org/abs/2602.09891v1
- Date: Tue, 10 Feb 2026 15:30:12 GMT
- Title: Stemphonic: All-at-once Flexible Multi-stem Music Generation
- Authors: Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, Cheng-Zhi Anna Huang, Nicholas J. Bryan,
- Abstract summary: Music stem generation offers greater user control and better alignment with musician.<n>We propose Stemphonic, a diffusion-/flow-based framework that generates a variable set of synchronized stems in one inference pass.<n>We show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%.
- Score: 15.126857537352182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Related papers
- MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning [18.636738208526676]
MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities.<n>To enable zero-shot voice cloning, we introduce a classifier injection mechanism that effectively decouples speaker identity from linguistic content.<n> Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks.
arXiv Detail & Related papers (2026-01-04T15:26:15Z) - Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures [12.393086516044866]
In this work, we explore singing voice separation from real music recordings using a diffusion model.<n>We present a study of the sampling algorithm, highlighting the effects of the user-configurable parameters.
arXiv Detail & Related papers (2025-11-26T12:49:35Z) - Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model [12.393086516044866]
In this work, we study the potential of diffusion models to advance toward bridging this gap.<n>We focus on generative singing voice separation relying on corresponding pairs of isolated vocals and mixtures for training.<n>To align with creative mixtures, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio.
arXiv Detail & Related papers (2025-11-25T16:34:07Z) - High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation.<n>It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately.<n>To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z) - Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length.
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations.
Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Modeling the Compatibility of Stem Tracks to Generate Music Mashups [6.922825755771942]
A music mashup combines audio elements from two or more songs to create a new work.
Research has developed algorithms that predict the compatibility of audio elements.
arXiv Detail & Related papers (2021-03-26T01:51:11Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.