Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural
Networks
- URL: http://arxiv.org/abs/2009.04172v1
- Date: Wed, 9 Sep 2020 09:11:49 GMT
- Title: Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural
Networks
- Authors: Helena Cuesta, Brian McFee, Emilia G\'omez
- Abstract summary: This paper addresses the extraction of multiple F0 values from polyphonic and a cappella vocal performances using convolutional neural networks (CNNs)
We build upon an existing architecture to produce a pitch salience function of the input signal.
For training, we build a dataset that comprises several multi-track datasets of vocal quartets with F0 annotations.
- Score: 7.088324036549911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the extraction of multiple F0 values from polyphonic and
a cappella vocal performances using convolutional neural networks (CNNs). We
address the major challenges of ensemble singing, i.e., all melodic sources are
vocals and singers sing in harmony. We build upon an existing architecture to
produce a pitch salience function of the input signal, where the harmonic
constant-Q transform (HCQT) and its associated phase differentials are used as
an input representation. The pitch salience function is subsequently
thresholded to obtain a multiple F0 estimation output. For training, we build a
dataset that comprises several multi-track datasets of vocal quartets with F0
annotations. This work proposes and evaluates a set of CNNs for this task in
diverse scenarios and data configurations, including recordings with additional
reverb. Our models outperform a state-of-the-art method intended for the same
music genre when evaluated with an increased F0 resolution, as well as a
general-purpose method for multi-F0 estimation. We conclude with a discussion
on future research directions.
Related papers
- Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Musical Voice Separation as Link Prediction: Modeling a Musical
Perception Task as a Multi-Trajectory Tracking Problem [6.617487928813374]
This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece.
We model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e. notes in a pitch-time space.
Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream.
arXiv Detail & Related papers (2023-04-28T13:48:00Z) - Extract fundamental frequency based on CNN combined with PYIN [5.837881923712393]
PYIN is applied to supplement the F0 extracted from the trained CNN model to combine the advantages of these two algorithms.
Four pieces played by two violins are used, and the performance of the models are evaluated accoring to the flatness of the F0 curve extracted.
arXiv Detail & Related papers (2022-08-17T15:34:54Z) - Audio Deepfake Detection Based on a Combination of F0 Information and
Real Plus Imaginary Spectrogram Features [51.924340387119415]
Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task.
Our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
arXiv Detail & Related papers (2022-08-02T02:46:16Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - HarmoF0: Logarithmic Scale Dilated Convolution For Pitch Estimation [7.5089093564620155]
This paper introduces a multiple rates dilated causal convolution (MRDC-Conv) method to capture the harmonic structure in logarithmic scale spectrograms efficiently.
We propose HarmoF0, a fully convolutional network, to evaluate the MRDC-Conv and other dilated convolutions in pitch estimation.
The results show that this model outperforms the DeepF0, yields state-of-the-art performance in three datasets, and simultaneously reduces more than 90% parameters.
arXiv Detail & Related papers (2022-05-02T16:45:20Z) - Pitch-Informed Instrument Assignment Using a Deep Convolutional Network
with Multiple Kernel Shapes [22.14133334414372]
This paper proposes a deep convolutional neural network for performing note-level instrument assignment.
Experiments on the MusicNet dataset using 7 instrument classes show that our approach is able to achieve an average F-score of 0.904.
arXiv Detail & Related papers (2021-07-28T19:48:09Z) - DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech
Signals [11.939409227407769]
We propose a novel pitch estimation technique called DeepF0.
It leverages the available annotated data to directly learn from the raw audio in a data-driven manner.
arXiv Detail & Related papers (2021-02-11T23:11:22Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.