DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech
Signals
- URL: http://arxiv.org/abs/2102.06306v1
- Date: Thu, 11 Feb 2021 23:11:22 GMT
- Title: DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech
Signals
- Authors: Satwinder Singh, Ruili Wang, Yuanhang Qiu
- Abstract summary: We propose a novel pitch estimation technique called DeepF0.
It leverages the available annotated data to directly learn from the raw audio in a data-driven manner.
- Score: 11.939409227407769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel pitch estimation technique called DeepF0, which leverages
the available annotated data to directly learns from the raw audio in a
data-driven manner. F0 estimation is important in various speech processing and
music information retrieval applications. Existing deep learning models for
pitch estimations have relatively limited learning capabilities due to their
shallow receptive field. The proposed model addresses this issue by extending
the receptive field of a network by introducing the dilated convolutional
blocks into the network. The dilation factor increases the network receptive
field exponentially without increasing the parameters of the model
exponentially. To make the training process more efficient and faster, DeepF0
is augmented with residual blocks with residual connections. Our empirical
evaluation demonstrates that the proposed model outperforms the baselines in
terms of raw pitch accuracy and raw chroma accuracy even using 77.4% fewer
network parameters. We also show that our model can capture reasonably well
pitch estimation even under the various levels of accompaniment noise.
Related papers
- Bayesian Deep Learning for Remaining Useful Life Estimation via Stein
Variational Gradient Descent [14.784809634505903]
We show that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance.
We propose a method to enhance performance based on the uncertainty information provided by the Bayesian models.
arXiv Detail & Related papers (2024-02-02T02:21:06Z) - Parallel and Limited Data Voice Conversion Using Stochastic Variational
Deep Kernel Learning [2.5782420501870296]
This paper proposes a voice conversion method that works with limited data.
It is based on variational deep kernel learning (SVDKL)
It is possible to estimate non-smooth and more complex functions.
arXiv Detail & Related papers (2023-09-08T16:32:47Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Learning Summary Statistics for Bayesian Inference with Autoencoders [58.720142291102135]
We use the inner dimension of deep neural network based Autoencoders as summary statistics.
To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information that has been used to generate the training data.
arXiv Detail & Related papers (2022-01-28T12:00:31Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - SignalNet: A Low Resolution Sinusoid Decomposition and Estimation
Network [79.04274563889548]
We propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples.
We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions.
In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data.
arXiv Detail & Related papers (2021-06-10T04:21:20Z) - High-Fidelity and Low-Latency Universal Neural Vocoder based on
Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform
Modeling [38.828260316517536]
This paper presents a novel universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP)
Experiments demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions.
arXiv Detail & Related papers (2021-05-20T16:02:45Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Ensemble Wrapper Subsampling for Deep Modulation Classification [70.91089216571035]
Subsampling of received wireless signals is important for relaxing hardware requirements as well as the computational cost of signal processing algorithms.
We propose a subsampling technique to facilitate the use of deep learning for automatic modulation classification in wireless communication systems.
arXiv Detail & Related papers (2020-05-10T06:11:13Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.