Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction
- URL: http://arxiv.org/abs/2308.02723v1
- Date: Fri, 4 Aug 2023 21:59:40 GMT
- Title: Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction
- Authors: Keren Shao, Ke Chen, Taylor Berg-Kirkpatrick, Shlomo Dubnov
- Abstract summary: We propose an input feature modification and a training objective modification based on two assumptions.
To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity representation using discrete z-transform.
We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network.
- Score: 36.45127093978295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep learning research, many melody extraction models rely on redesigning
neural network architectures to improve performance. In this paper, we propose
an input feature modification and a training objective modification based on
two assumptions. First, harmonics in the spectrograms of audio data decay
rapidly along the frequency axis. To enhance the model's sensitivity on the
trailing harmonics, we modify the Combined Frequency and Periodicity (CFP)
representation using discrete z-transform. Second, the vocal and non-vocal
segments with extremely short duration are uncommon. To ensure a more stable
melody contour, we design a differentiable loss function that prevents the
model from predicting such segments. We apply these modifications to several
models, including MSNet, FTANet, and a newly introduced model, PianoNet,
modified from a piano transcription network. Our experimental results
demonstrate that the proposed modifications are empirically effective for
singing melody extraction.
Related papers
- Sine, Transient, Noise Neural Modeling of Piano Notes [0.0]
Three sub-modules learn components from piano recordings and generate harmonic, transient, and noise signals.
From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network.
Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges.
arXiv Detail & Related papers (2024-09-10T13:48:18Z) - Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models [7.928003786376716]
We propose novel architectures for convolutional recurrent neural networks.
We improve note-state sequence modeling by using a pitchwise LSTM.
We show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset.
arXiv Detail & Related papers (2024-04-10T08:06:15Z) - Emotion-Conditioned Melody Harmonization with Hierarchical Variational
Autoencoder [11.635877697635449]
We propose a novel LSTM-based Hierarchical Variational Auto-Encoder (LHVAE) to investigate the influence of emotional conditions on melody harmonization.
LHVAE incorporates latent variables and emotional conditions at different levels to model the global and local music properties.
Objective experimental results show that our proposed model outperforms other LSTM-based models.
arXiv Detail & Related papers (2023-06-06T14:28:57Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic
Music [43.17623332544677]
TONet is a plug-and-play model that improves both tone and octave perceptions.
We present an improved input representation, the Tone-CFP, that explicitly groups harmonics.
Third, we propose a tone-octave fusion mechanism to improve the final salience feature map.
arXiv Detail & Related papers (2022-02-02T10:55:48Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.