Improving Perceptual Quality of Drum Transcription with the Expanded
Groove MIDI Dataset
- URL: http://arxiv.org/abs/2004.00188v5
- Date: Tue, 1 Dec 2020 18:11:04 GMT
- Title: Improving Perceptual Quality of Drum Transcription with the Expanded
Groove MIDI Dataset
- Authors: Lee Callender, Curtis Hawthorne, Jesse Engel
- Abstract summary: Expanded Groove MIDI dataset (E-GMD) contains 444 hours of audio from 43 drum kits.
We use E-GMD to optimize classifiers for use in downstream generation by predicting expressive dynamics (velocity) and show with listening tests that they produce outputs with improved perceptual quality, despite similar results on classification metrics.
- Score: 2.3204178451683264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the Expanded Groove MIDI dataset (E-GMD), an automatic drum
transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits,
making it an order of magnitude larger than similar datasets, and the first
with human-performed velocity annotations. We use E-GMD to optimize classifiers
for use in downstream generation by predicting expressive dynamics (velocity)
and show with listening tests that they produce outputs with improved
perceptual quality, despite similar results on classification metrics. Via the
listening tests, we argue that standard classifier metrics, such as accuracy
and F-measure score, are insufficient proxies of performance in downstream
tasks because they do not fully align with the perceptual quality of generated
outputs.
Related papers
- Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models [2.3749120526936465]
We propose and investigate the use of neural audio language models for the automatic generation of sample-based musical instruments.
Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding.
arXiv Detail & Related papers (2024-07-22T13:59:58Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Conditional Sound Generation Using Neural Discrete Time-Frequency
Representation Learning [42.95813372611093]
We propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning.
This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip.
arXiv Detail & Related papers (2021-07-21T10:31:28Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - The MIDI Degradation Toolkit: Symbolic Music Augmentation and Correction [14.972219905728963]
We introduce the MIDI Degradation Toolkit (MDTK), containing functions which take as input a musical excerpt.
Using the toolkit, we create the Altered and Corrupted MIDI Excerpts dataset version 1.0.
We propose four tasks of increasing difficulty to detect, classify, locate, and correct the degradations.
arXiv Detail & Related papers (2020-09-30T19:03:35Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.