Singing voice synthesis based on frame-level sequence-to-sequence models
considering vocal timing deviation
- URL: http://arxiv.org/abs/2301.02262v1
- Date: Thu, 5 Jan 2023 19:00:10 GMT
- Title: Singing voice synthesis based on frame-level sequence-to-sequence models
considering vocal timing deviation
- Authors: Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, and
Keiichi Tokuda
- Abstract summary: singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation.
In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account differences between actual vocal timing and note start timing.
- Score: 15.185681242504467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes singing voice synthesis (SVS) based on frame-level
sequence-to-sequence models considering vocal timing deviation. In SVS, it is
essential to synchronize the timing of singing with temporal structures
represented by scores, taking into account that there are differences between
actual vocal timing and note start timing. In many SVS systems including our
previous work, phoneme-level score features are converted into frame-level ones
on the basis of phoneme boundaries obtained by external aligners to take into
account vocal timing deviations. Therefore, the sound quality is affected by
the aligner accuracy in this system. To alleviate this problem, we introduce an
attention mechanism with frame-level features. In the proposed system, the
attention mechanism absorbs alignment errors in phoneme boundaries.
Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by
heuristic rules based on musical scores when there is no aligner. The
experimental results show the effectiveness of the proposed system.
Related papers
- AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual
Voice Conversion [2.3443118032034396]
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing.
Our model outperforms existing state-of-the-art results in both subjective and objective evaluations.
arXiv Detail & Related papers (2023-10-10T11:50:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Singing-Tacotron: Global duration control attention and dynamic filter
for End-to-end singing voice synthesis [67.96138567288197]
This paper proposes an end-to-end singing voice synthesis framework, named Singing-Tacotron.
The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information.
arXiv Detail & Related papers (2022-02-16T07:35:17Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System [25.573552964889963]
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system.
The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder.
Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch.
arXiv Detail & Related papers (2021-08-05T17:59:58Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.