Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy
Loss
- URL: http://arxiv.org/abs/2010.12024v2
- Date: Fri, 26 Feb 2021 16:33:22 GMT
- Title: Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy
Loss
- Authors: Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin
- Abstract summary: We propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network.
With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models.
- Score: 49.62291237343537
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The neural network (NN) based singing voice synthesis (SVS) systems require
sufficient data to train well and are prone to over-fitting due to data
scarcity. However, we often encounter data limitation problem in building SVS
systems because of high data acquisition and annotation costs. In this work, we
propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing
model to regularize the network. With a one-hour open-source singing voice
database, we explore the impact of the PE loss on various mainstream
sequence-to-sequence models, including the RNN-based, transformer-based, and
conformer-based models. Our experiments show that the PE loss can mitigate the
over-fitting problem and significantly improve the synthesized singing quality
reflected in objective and subjective evaluations.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Score-based Generative Priors Guided Model-driven Network for MRI Reconstruction [14.53268880380804]
We propose a novel workflow where naive SMLD samples serve as additional priors to guide model-driven network training.
First, we adopted a pretrained score network to generate samples as preliminary guidance images (PGI)
Second, we designed a denoising module (DM) in the second step to coarsely eliminate artifacts and noises from PGIs.
Third, we designed a model-driven network guided by denoised PGIs to further recover fine details.
arXiv Detail & Related papers (2024-05-05T14:56:34Z) - Hyperspectral Image Denoising via Self-Modulating Convolutional Neural
Networks [15.700048595212051]
We introduce a self-modulating convolutional neural network which utilizes correlated spectral and spatial information.
At the core of the model lies a novel block, which allows the network to transform the features in an adaptive manner based on the adjacent spectral data.
Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods.
arXiv Detail & Related papers (2023-09-15T06:57:43Z) - DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic
Echo Cancellation, Noise Suppression and Dereverberation [12.734839065028547]
This paper proposes a real-time cross-attention deep model named DeepVQE, based on residual convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We conduct ablation studies analyze the contributions of different components of our model to achieve the overall performance.
DeepVQE state-of-the-art performance on nonpersonalized tracks from the ICASSP 2023 Acoustic Echo Challenge and ICASSP 2023 Deep Noise Suppression Challenge test sets, showing that a single model can handle multiple tasks with excellent performance.
arXiv Detail & Related papers (2023-06-05T18:37:05Z) - Deep learning for full-field ultrasonic characterization [7.120879473925905]
This study takes advantage of recent advances in machine learning to establish a physics-based data analytic platform.
Two logics, namely the direct inversion and physics-informed neural networks (PINNs), are explored.
arXiv Detail & Related papers (2023-01-06T05:01:05Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - PRVNet: A Novel Partially-Regularized Variational Autoencoders for
Massive MIMO CSI Feedback [15.972209500908642]
In a multiple-input multiple-output frequency-division duplexing (MIMO-FDD) system, the user equipment (UE) sends the downlink channel state information (CSI) to the base station to report link status.
In this paper, we introduce PRVNet, a neural network architecture inspired by variational autoencoders (VAE) to compress the CSI matrix before sending it back to the base station.
arXiv Detail & Related papers (2020-11-09T04:07:45Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.