A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
- URL: http://arxiv.org/abs/2406.18747v2
- Date: Mon, 26 Aug 2024 01:07:11 GMT
- Title: A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
- Authors: Karn N. Watcharasupat, Alexander Lerch,
- Abstract summary: Banquet is a system that allows source separation of multiple stems using just one decoder.
A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model.
- Score: 53.30852012059025
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.
Related papers
- An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation [0.4893345190925179]
Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals.
This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance.
arXiv Detail & Related papers (2024-10-28T06:18:12Z) - Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation [5.926447149127937]
Cinematic audio source separation is a new subtask of audio source separation.
A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems.
We demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem.
arXiv Detail & Related papers (2024-08-07T07:04:29Z) - DiffMoog: a Differentiable Modular Synthesizer for Sound Matching [48.33168531500444]
DiffMoog is a differentiable modular synthesizer with a comprehensive set of modules typically found in commercial instruments.
Being differentiable, it allows integration into neural networks, enabling automated sound matching.
We introduce an open-source platform that comprises DiffMoog and an end-to-end sound matching framework.
arXiv Detail & Related papers (2024-01-23T08:59:21Z) - Toward Deep Drum Source Separation [52.01259769265708]
We introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems.
Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date.
We leverage StemGMD to develop LarsNet, a novel deep drum source separation model.
arXiv Detail & Related papers (2023-12-15T10:23:07Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Moisesdb: A dataset for source separation beyond 4-stems [0.9176056742068811]
This paper introduces the MoisesDB dataset for musical source separation.
It consists of 240 tracks from 45 artists, covering twelve musical genres.
For each song, we provide its individual audio sources, organized in a two-level hierarchical taxonomy of stems.
arXiv Detail & Related papers (2023-07-29T06:59:37Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Multitask learning for instrument activation aware music source
separation [83.30944624666839]
We propose a novel multitask structure to investigate using instrument activation information to improve source separation performance.
We investigate our system on six independent instruments, a more realistic scenario than the three instruments included in the widely-used MUSDB dataset.
The results show that our proposed multitask model outperforms the baseline Open-Unmix model on the mixture of Mixing Secrets and MedleyDB dataset.
arXiv Detail & Related papers (2020-08-03T02:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.