Learning and controlling the source-filter representation of speech with
a variational autoencoder
- URL: http://arxiv.org/abs/2204.07075v3
- Date: Tue, 21 Mar 2023 10:41:12 GMT
- Title: Learning and controlling the source-filter representation of speech with
a variational autoencoder
- Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda,
Renaud S\'eguier
- Abstract summary: In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors.
We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces.
Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
- Score: 23.05989605017053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and controlling latent representations in deep generative
models is a challenging yet important problem for analyzing, transforming and
generating various types of data. In speech processing, inspiring from the
anatomical mechanisms of phonation, the source-filter model considers that
speech signals are produced from a few independent and physically meaningful
continuous latent factors, among which the fundamental frequency $f_0$ and the
formants are of primary importance. In this work, we start from a variational
autoencoder (VAE) trained in an unsupervised manner on a large dataset of
unlabeled natural speech signals, and we show that the source-filter model of
speech production naturally arises as orthogonal subspaces of the VAE latent
space. Using only a few seconds of labeled speech signals generated with an
artificial speech synthesizer, we propose a method to identify the latent
subspaces encoding $f_0$ and the first three formant frequencies, we show that
these subspaces are orthogonal, and based on this orthogonality, we develop a
method to accurately and independently control the source-filter speech factors
within the latent subspaces. Without requiring additional information such as
text or human-labeled data, this results in a deep generative model of speech
spectrograms that is conditioned on $f_0$ and the formant frequencies, and
which is applied to the transformation speech signals. Finally, we also propose
a robust $f_0$ estimation method that exploits the projection of a speech
signal onto the learned latent subspace associated with $f_0$.
Related papers
- VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space [0.49109372384514843]
VQalAttent is a lightweight model designed to generate fake speech with tunable performance and interpretability.
Our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources.
arXiv Detail & Related papers (2024-11-22T00:21:39Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Interpretable Acoustic Representation Learning on Breathing and Speech
Signals for COVID-19 Detection [37.01066509527848]
We describe an approach for representation learning of audio signals for the task of COVID-19 detection.
The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions.
The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism.
arXiv Detail & Related papers (2022-06-27T15:20:51Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Nonlinear ISA with Auxiliary Variables for Learning Speech
Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables.
We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z) - Speech-to-Singing Conversion based on Boundary Equilibrium GAN [42.739822506085694]
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
arXiv Detail & Related papers (2020-05-28T08:18:02Z) - Cross-modal variational inference for bijective signal-symbol
translation [11.444576186559486]
In this paper, we propose an approach for signal/symbol translation by turning this problem into a density estimation task.
We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint.
In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation.
arXiv Detail & Related papers (2020-02-10T15:25:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.