The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis
- URL: http://arxiv.org/abs/2505.03337v2
- Date: Tue, 30 Sep 2025 09:14:34 GMT
- Title: The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis
- Authors: Bernardo Torres, Geoffroy Peeters, Gael Richard,
- Abstract summary: Inverse Drum Machine is a novel approach to Drum Source Separation that leverages an analysis-by-synthesis framework combined with deep learning.<n>IDM integrates Automatic Drum Transcription and One-shot Drum Sample Synthesis, jointly optimizing these tasks in an end-to-end manner.<n> Experiments on the StemGMD dataset demonstrate that IDM achieves separation quality comparable to state-of-the-art supervised methods that require isolated stems data.
- Score: 11.79997201521754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Inverse Drum Machine, a novel approach to Drum Source Separation that leverages an analysis-by-synthesis framework combined with deep learning. Unlike recent supervised methods that require isolated stem recordings for training, our approach is trained on drum mixtures with only transcription annotations. IDM integrates Automatic Drum Transcription and One-shot Drum Sample Synthesis, jointly optimizing these tasks in an end-to-end manner. By convolving synthesized one-shot samples with estimated onsets, akin to a drum machine, we reconstruct the individual drum stems and train a Deep Neural Network on the reconstruction of the mixture. Experiments on the StemGMD dataset demonstrate that IDM achieves separation quality comparable to state-of-the-art supervised methods that require isolated stems data.
Related papers
- High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers [49.1574468325115]
We introduce a unified convergence analysis framework for deterministic samplers.
Our framework achieves iteration complexity of $tilde O(d2/epsilon)$.
We also provide a detailed analysis of Denoising Implicit Diffusion Models (DDIM)-type samplers.
arXiv Detail & Related papers (2024-10-18T07:37:36Z) - Subtractive Training for Music Stem Insertion using Latent Diffusion Models [35.91945598575059]
We present Subtractive Training, a method for synthesizing individual musical instrument stems given other instruments as context.<n>Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks.<n>We extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
arXiv Detail & Related papers (2024-06-27T16:59:14Z) - Toward Deep Drum Source Separation [52.01259769265708]
We introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems.
Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date.
We leverage StemGMD to develop LarsNet, a novel deep drum source separation model.
arXiv Detail & Related papers (2023-12-15T10:23:07Z) - PROM: A Phrase-level Copying Mechanism with Pre-training for Abstractive
Summarization [139.242907155883]
This work proposes PROM, a new PhRase-level cOpying Mechanism that enhances attention on n-grams.
PROM adds an indicator layer to explicitly pick up tokens in n-gram that can be copied from the source, and calculates an auxiliary loss for the copying prediction.
In zero-shot setting, PROM is utilized in the self-supervised pre-training on raw corpora and provides new general baselines on a wide range of summarization datasets.
arXiv Detail & Related papers (2023-05-11T08:29:05Z) - Neural Machine Translation with Contrastive Translation Memories [71.86990102704311]
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios.
We propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence.
In training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence.
arXiv Detail & Related papers (2022-12-06T17:10:17Z) - Conditional Drums Generation using Compound Word Representations [4.435094091999926]
We tackle the task of conditional drums generation using a novel data encoding scheme inspired by Compound Word representation.
We present a sequence-to-sequence architecture where a Bidirectional Long short-term memory (BiLSTM) receives information about the conditioning parameters.
A Transformer-based Decoder with relative global attention produces the generated drum sequences.
arXiv Detail & Related papers (2022-02-09T13:49:27Z) - Differentiable Digital Signal Processing Mixture Model for Synthesis
Parameter Extraction from Mixture of Harmonic Sounds [29.012177604120048]
A differentiable digital signal processing (DDSP) autoencoder is a musical sound that combines a deep neural network (DNN) and spectral modeling synthesis.
It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound.
It is designed for a monophonic harmonic sound and cannot handle mixtures of sounds harmonic.
arXiv Detail & Related papers (2022-02-01T03:38:49Z) - Reference-based Magnetic Resonance Image Reconstruction Using Texture
Transforme [86.6394254676369]
We propose a novel Texture Transformer Module (TTM) for accelerated MRI reconstruction.
We formulate the under-sampled data and reference data as queries and keys in a transformer.
The proposed TTM can be stacked on prior MRI reconstruction approaches to further improve their performance.
arXiv Detail & Related papers (2021-11-18T03:06:25Z) - Source Separation-based Data Augmentation for Improved Joint Beat and
Downbeat Tracking [33.05612957858605]
We propose to use a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal.
We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-16T11:09:05Z) - Global Structure-Aware Drum Transcription Based on Self-Attention
Mechanisms [18.5148472561169]
This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal.
To capture the global repetitive structure of drum scores, we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder.
Experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure.
arXiv Detail & Related papers (2021-05-12T17:04:16Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Multitask learning for instrument activation aware music source
separation [83.30944624666839]
We propose a novel multitask structure to investigate using instrument activation information to improve source separation performance.
We investigate our system on six independent instruments, a more realistic scenario than the three instruments included in the widely-used MUSDB dataset.
The results show that our proposed multitask model outperforms the baseline Open-Unmix model on the mixture of Mixing Secrets and MedleyDB dataset.
arXiv Detail & Related papers (2020-08-03T02:35:00Z) - wav2shape: Hearing the Shape of a Drum Machine [4.283530753133897]
Disentangling and recovering physical attributes from a waveform is a challenging inverse problem in audio signal processing.
We propose to address this problem via a combination of time--frequency analysis and supervised machine learning.
arXiv Detail & Related papers (2020-07-20T17:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.