SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech
Translation
- URL: http://arxiv.org/abs/2002.12231v1
- Date: Thu, 27 Feb 2020 16:22:42 GMT
- Title: SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech
Translation
- Authors: Arya D. McCarthy and Liezl Puzon and Juan Pino
- Abstract summary: We propose autoencoding speaker conversion for training data augmentation in automatic speech translation.
This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice.
Our method compares favorably to SpecAugment on English$to$French and English$to$Romanian automatic speech translation (AST) tasks.
- Score: 12.292167129361825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose autoencoding speaker conversion for training data augmentation in
automatic speech translation. This technique directly transforms an audio
sequence, resulting in audio synthesized to resemble another speaker's voice.
Our method compares favorably to SpecAugment on English$\to$French and
English$\to$Romanian automatic speech translation (AST) tasks as well as on a
low-resource English automatic speech recognition (ASR) task. Further, in
ablations, we show the benefits of both quantity and diversity in augmented
data. Finally, we show that we can combine our approach with augmentation by
machine-translated transcripts to obtain a competitive end-to-end AST model
that outperforms a very strong cascade model on an English$\to$French AST task.
Our method is sufficiently general that it can be applied to other speech
generation and analysis tasks.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.