Enhancement of Pitch Controllability using Timbre-Preserving Pitch
Augmentation in FastPitch
- URL: http://arxiv.org/abs/2204.05753v1
- Date: Tue, 12 Apr 2022 12:48:06 GMT
- Title: Enhancement of Pitch Controllability using Timbre-Preserving Pitch
Augmentation in FastPitch
- Authors: Hanbin Bae, Young-Sun Joo
- Abstract summary: We propose two algorithms to improve the robustness of FastPitch.
First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation.
The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch.
- Score: 3.858078488714278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently developed pitch-controllable text-to-speech (TTS) model, i.e.
FastPitch, was conditioned for the pitch contours. However, the quality of the
synthesized speech degraded considerably for pitch values that deviated
significantly from the average pitch; i.e. the ability to control pitch was
limited. To address this issue, we propose two algorithms to improve the
robustness of FastPitch. First, we propose a novel timbre-preserving
pitch-shifting algorithm for natural pitch augmentation. Pitch-shifted speech
samples sound more natural when using the proposed algorithm because the
speaker's vocal timbre is maintained. Moreover, we propose a training algorithm
that defines FastPitch using pitch-augmented speech datasets with different
pitch ranges for the same sentence. The experimental results demonstrate that
the proposed algorithms improve the pitch controllability of FastPitch.
Related papers
- PitchFlower: A flow-based neural audio codec with pitch controllability [8.972144370022841]
We present PitchFlower, a flow-based neural audio with explicit pitch controllability.<n>A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio.
arXiv Detail & Related papers (2025-10-29T14:33:35Z) - Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters [7.865191493201841]
Control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion.<n>We propose a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity.<n>The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity.
arXiv Detail & Related papers (2025-07-07T09:36:00Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - PTP: Boosting Stability and Performance of Prompt Tuning with
Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning.
We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based.
Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z) - PITS: Variational Pitch Inference without Fundamental Frequency for
End-to-End Pitch-controllable TTS [1.5599422325061418]
PITS is an end-to-end pitch-controllable text-to-speech model.
Pits incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability.
arXiv Detail & Related papers (2023-02-24T01:43:17Z) - DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion [17.83563578034567]
We propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network.
We show the effectiveness of the proposed method by objective and subjective evaluations.
arXiv Detail & Related papers (2022-10-20T07:30:07Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Optimization of a Real-Time Wavelet-Based Algorithm for Improving Speech
Intelligibility [1.0554048699217666]
The discrete-time speech signal is split into frequency sub-bands via a multi-level discrete wavelet transform.
The sub-band gains are adjusted while keeping the overall signal energy unchanged.
The speech intelligibility under various background interference and simulated hearing loss conditions is enhanced.
arXiv Detail & Related papers (2022-02-05T13:03:57Z) - Unsupervised Classification of Voiced Speech and Pitch Tracking Using
Forward-Backward Kalman Filtering [14.950964357181524]
We present a new algorithm that integrates the three subtasks into a single procedure.
The algorithm can be applied to pre-recorded speech utterances in the presence of considerable amounts of background noise.
arXiv Detail & Related papers (2021-03-01T18:13:23Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - FastPitch: Parallel Text-to-speech with Pitch Prediction [9.213700601337388]
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech.
The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
arXiv Detail & Related papers (2020-06-11T23:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.