Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis:
ZeroSpeech 2020 Challenge
- URL: http://arxiv.org/abs/2005.11676v1
- Date: Sun, 24 May 2020 07:42:43 GMT
- Title: Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis:
ZeroSpeech 2020 Challenge
- Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
- Abstract summary: The ZeroSpeech 2020 challenge is to build a speech without any textual information or phonetic labels.
We build a system that must address two major components such as 1) given speech audio, extract subword units in an unsupervised way and 2) re-synthesize the audio from novel speakers.
Our main contribution here is we proposed Transformer-based VQ-VAE for unsupervised unit discovery and Transformer-based inverter for the speech synthesis given the extracted codebook.
- Score: 27.314082075933197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we report our submitted system for the ZeroSpeech 2020
challenge on Track 2019. The main theme in this challenge is to build a speech
synthesizer without any textual information or phonetic labels. In order to
tackle those challenges, we build a system that must address two major
components such as 1) given speech audio, extract subword units in an
unsupervised way and 2) re-synthesize the audio from novel speakers. The system
also needs to balance the codebook performance between the ABX error rate and
the bitrate compression rate. Our main contribution here is we proposed
Transformer-based VQ-VAE for unsupervised unit discovery and Transformer-based
inverter for the speech synthesis given the extracted codebook. Additionally,
we also explored several regularization methods to improve performance even
further.
Related papers
- VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers [119.89284877061779]
This paper introduces VALL-E 2, the latest advancement in neural language models that marks a milestone in zero-shot text-to-speech (TTS)
VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.
The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.
arXiv Detail & Related papers (2024-06-08T06:31:03Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - The FruitShell French synthesis system at the Blizzard 2023 Challenge [12.459890525109646]
This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023.
The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals.
arXiv Detail & Related papers (2023-09-01T02:56:20Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Non-autoregressive sequence-to-sequence voice conversion [47.521186595305984]
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence.
arXiv Detail & Related papers (2021-04-14T11:53:51Z) - The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders [42.636504426142906]
We present the voice conversion systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC 2020)
We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders.
arXiv Detail & Related papers (2020-10-09T09:19:37Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge [26.114011076658237]
We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
arXiv Detail & Related papers (2020-05-19T13:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.