Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription
- URL: http://arxiv.org/abs/2509.21739v1
- Date: Fri, 26 Sep 2025 01:12:43 GMT
- Title: Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription
- Authors: Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima,
- Abstract summary: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms.<n>Noise-to-Notes (N2N) is a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities.<n>N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
- Score: 6.453619274330351
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
Related papers
- TLDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection [39.234515088121086]
We propose a novel framework, TLDiffGAN, which consists of two complementary branches.<n>One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples.<n>We also introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked.
arXiv Detail & Related papers (2026-02-01T07:04:30Z) - Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance [54.88271057438763]
Noise Awareness Guidance (NAG) is a correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule.<n>NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
arXiv Detail & Related papers (2025-10-14T13:31:34Z) - FM-SIREN & FM-FINER: Nyquist-Informed Frequency Multiplier for Implicit Neural Representation with Periodic Activation [15.456760941404873]
We propose FM-SIREN and FM-FINER, which assign Nyquist-informed, neuron-specific frequency multipliers to periodic activations.<n>This simple yet principled modification reduces the redundancy of features by nearly 50% and consistently improves signal reconstruction across diverse INR tasks.
arXiv Detail & Related papers (2025-09-27T18:14:47Z) - Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing [92.61216319417208]
We propose a novel frequency domain-based diffusion model, named ours, for fully exploiting the beneficial knowledge in unpaired clear data.<n>Inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction.
arXiv Detail & Related papers (2025-07-02T01:22:46Z) - Noise Conditional Variational Score Distillation [60.38982038894823]
Noise Conditional Variational Score Distillation (NCVSD) is a novel method for distilling pretrained diffusion models into generative denoisers.<n>By integrating this insight into the Variational Score Distillation framework, we enable scalable learning of generative denoisers.
arXiv Detail & Related papers (2025-06-11T06:01:39Z) - Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models [1.0579965347526206]
Large language models (LLMs) often produce inaccurate or misleading content-hallucinations.<n>Noise-Augmented Fine-Tuning (NoiseFiT) is a novel framework that leverages adaptive noise injection to enhance model robustness.<n>NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise.
arXiv Detail & Related papers (2025-04-04T09:27:19Z) - DenoMAE: A Multimodal Autoencoder for Denoising Modulation Signals [21.25974800554959]
DenoMAE is a novel framework for denoising modulation signals during pretraining.<n>It incorporates multiple input modalities, including noise, to enhance cross-modal learning.<n>It achieves state-of-the-art accuracy in automatic modulation classification tasks.
arXiv Detail & Related papers (2025-01-20T15:23:16Z) - DiffFNO: Diffusion Fourier Neural Operator [8.895165270489167]
We introduce DiffFNO, a novel diffusion framework for arbitrary-scale super-resolution strengthened by a Weighted Fourier Neural Operator (WFNO)<n>Mode Rebalancing in WFNO effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details.<n>Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.
arXiv Detail & Related papers (2024-11-15T03:14:11Z) - Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models.<n>We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection [80.20339155618612]
DiffusionAD is a novel anomaly detection pipeline comprising a reconstruction sub-network and a segmentation sub-network.<n>A rapid one-step denoising paradigm achieves hundreds of times acceleration while preserving comparable reconstruction quality.<n>Considering the diversity in the manifestation of anomalies, we propose a norm-guided paradigm to integrate the benefits of multiple noise scales.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.