Unsupervised Cross-Domain Speech-to-Speech Conversion with
Time-Frequency Consistency
- URL: http://arxiv.org/abs/2005.07810v2
- Date: Tue, 19 May 2020 01:16:49 GMT
- Title: Unsupervised Cross-Domain Speech-to-Speech Conversion with
Time-Frequency Consistency
- Authors: Mohammad Asif Khan, Fabien Cardinaux, Stefan Uhlich, Marc Ferras, Asja
Fischer
- Abstract summary: We propose a condition encouraging spectrogram consistency during the adversarial training procedure.
Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
- Score: 14.062850439230111
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years generative adversarial network (GAN) based models have been
successfully applied for unsupervised speech-to-speech conversion.The rich
compact harmonic view of the magnitude spectrogram is considered a suitable
choice for training these models with audio data. To reconstruct the speech
signal first a magnitude spectrogram is generated by the neural network, which
is then utilized by methods like the Griffin-Lim algorithm to reconstruct a
phase spectrogram. This procedure bears the problem that the generated
magnitude spectrogram may not be consistent, which is required for finding a
phase such that the full spectrogram has a natural-sounding speech waveform. In
this work, we approach this problem by proposing a condition encouraging
spectrogram consistency during the adversarial training procedure. We
demonstrate our approach on the task of translating the voice of a male speaker
to that of a female speaker, and vice versa. Our experimental results on the
Librispeech corpus show that the model trained with the TF consistency provides
a perceptually better quality of speech-to-speech conversion.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Learning robust speech representation with an articulatory-regularized
variational autoencoder [13.541055956177937]
We develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features.
We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
arXiv Detail & Related papers (2021-04-07T15:47:04Z) - Multi-Discriminator Sobolev Defense-GAN Against Adversarial Attacks for
End-to-End Speech Systems [78.5097679815944]
This paper introduces a defense approach against end-to-end adversarial attacks developed for cutting-edge speech-to-text systems.
First, we represent speech signals with 2D spectrograms using the short-time Fourier transform.
Second, we iteratively find a safe vector using a spectrogram subspace projection operation.
Third, we synthesize a spectrogram with such a safe vector using a novel GAN architecture trained with Sobolev integral probability metric.
arXiv Detail & Related papers (2021-03-15T01:11:13Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z) - Speech-to-Singing Conversion based on Boundary Equilibrium GAN [42.739822506085694]
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
arXiv Detail & Related papers (2020-05-28T08:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.