Pitch Preservation In Singing Voice Synthesis
- URL: http://arxiv.org/abs/2110.05033v2
- Date: Tue, 12 Oct 2021 05:39:53 GMT
- Title: Pitch Preservation In Singing Voice Synthesis
- Authors: Shujun Liu, Hai Zhu, Kun Wang, Huajun Wang
- Abstract summary: This paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus.
Experimental results indicate that the proposed approaches can characterize intrinsic structure between pitch inputs to obtain better pitch synthesis accuracy and achieve superior singing synthesis performance against the advanced baseline system.
- Score: 6.99674326582747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Suffering from limited singing voice corpus, existing singing voice synthesis
(SVS) methods that build encoder-decoder neural networks to directly generate
spectrogram could lead to out-of-tune issues during the inference phase. To
attenuate these issues, this paper presents a novel acoustic model with
independent pitch encoder and phoneme encoder, which disentangles the phoneme
and pitch information from music score to fully utilize the corpus.
Specifically, according to equal temperament theory, the pitch encoder is
constrained by a pitch metric loss that maps distances between adjacent input
pitches into corresponding frequency multiples between the encoder outputs. For
the phoneme encoder, based on the analysis that same phonemes corresponding to
varying pitches can produce similar pronunciations, this encoder is followed by
an adversarially trained pitch classifier to enforce the identical phonemes
with different pitches mapping into the same phoneme feature space. By these
means, the sparse phonemes and pitches in original input spaces can be
transformed into more compact feature spaces respectively, where the same
elements cluster closely and cooperate mutually to enhance synthesis quality.
Then, the outputs of the two encoders are summed together to pass through the
following decoder in the acoustic model. Experimental results indicate that the
proposed approaches can characterize intrinsic structure between pitch inputs
to obtain better pitch synthesis accuracy and achieve superior singing
synthesis performance against the advanced baseline system.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - From Note-Level to Chord-Level Neural Network Models for Voice
Separation in Symbolic Music [0.0]
We train neural networks that assign notes to voices either separately for each note in a chord (note-level), or jointly to all notes in a chord (chord-level)
Both models surpass a strong baseline based on an iterative application of an envelope extraction function.
The two models are also shown to outperform previous approaches on separating the voices in Bach music.
arXiv Detail & Related papers (2020-11-05T18:39:42Z) - Semi-supervised Learning for Singing Synthesis Timbre [22.75251024528604]
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only.
Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder.
We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.
arXiv Detail & Related papers (2020-11-05T13:33:34Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.