Speech Modeling with a Hierarchical Transformer Dynamical VAE
- URL: http://arxiv.org/abs/2303.09404v2
- Date: Wed, 10 May 2023 13:55:59 GMT
- Title: Speech Modeling with a Hierarchical Transformer Dynamical VAE
- Authors: Xiaoyu Lin, Xiaoyu Bie, Simon Leglaive, Laurent Girin, Xavier
Alameda-Pineda
- Abstract summary: We propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE)
We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure.
- Score: 23.847366888695266
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The dynamical variational autoencoders (DVAEs) are a family of
latent-variable deep generative models that extends the VAE to model a sequence
of observed data and a corresponding sequence of latent vectors. In almost all
the DVAEs of the literature, the temporal dependencies within each sequence and
across the two sequences are modeled with recurrent neural networks. In this
paper, we propose to model speech signals with the Hierarchical Transformer
DVAE (HiT-DVAE), which is a DVAE with two levels of latent variable
(sequence-wise and frame-wise) and in which the temporal dependencies are
implemented with the Transformer architecture. We show that HiT-DVAE
outperforms several other DVAEs for speech spectrogram modeling, while enabling
a simpler training procedure, revealing its high potential for downstream
low-level speech processing tasks such as speech enhancement.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - GIVT: Generative Infinite-Vocabulary Transformers [18.55070896912795]
We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries.
Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, we use GIVT to model the unquantized real-valued latent sequences of a $beta$-VAE.
In class-conditional image generation GIVT outperforms VQ-GAN as well as MaskGIT, and achieves performance competitive with recent latent diffusion models.
arXiv Detail & Related papers (2023-12-04T18:48:02Z) - Unsupervised Speech Enhancement using Dynamical Variational
Auto-Encoders [29.796695365217893]
Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables.
We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs.
We derive a variational expectation-maximization algorithm to perform speech enhancement.
arXiv Detail & Related papers (2021-06-23T09:48:38Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - Dynamical Variational Autoencoders: A Comprehensive Review [23.25573952809074]
We introduce and discuss a general class of models, called dynamical variational autoencoders (DVAEs)
We present in detail seven recently proposed DVAE models, with an aim to homogenize the notations and presentation lines.
We have reimplemented those seven DVAE models and present the results of an experimental benchmark conducted on the speech analysis-resynthesis task.
arXiv Detail & Related papers (2020-08-28T11:49:33Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.