fairseq S2T: Fast Speech-to-Text Modeling with fairseq
- URL: http://arxiv.org/abs/2010.05171v2
- Date: Tue, 14 Jun 2022 14:16:52 GMT
- Title: fairseq S2T: Fast Speech-to-Text Modeling with fairseq
- Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro
Okhonko, Juan Pino
- Abstract summary: Fairseq S2T is a fairseq extension for speech-to-text (S2T) modeling tasks.
It provides end-to-end from data pre-processing to offline (online) inference.
- Score: 36.728277923926875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T)
modeling tasks such as end-to-end speech recognition and speech-to-text
translation. It follows fairseq's careful design for scalability and
extensibility. We provide end-to-end workflows from data pre-processing, model
training to offline (online) inference. We implement state-of-the-art
RNN-based, Transformer-based as well as Conformer-based models and open-source
detailed training recipes. Fairseq's machine translation models and language
models can be seamlessly integrated into S2T workflows for multi-task learning
or transfer learning. Fairseq S2T documentation and examples are available at
https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit [60.74922995613379]
fairseq S2 is a fairseq extension for speech synthesis.
We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants.
To enable training speech synthesis models with less curated data, a number of preprocessing tools are built.
arXiv Detail & Related papers (2021-09-14T18:20:28Z) - Scalable Multilingual Frontend for TTS [4.1203601403593275]
This paper describes progress towards making a Neural Text-to-Speech (TTS) Frontend that works for many languages and can be easily extended to new languages.
We take a Machine Translation inspired approach to constructing, and model both text normalization and pronunciation on a sentence level by building and using sequence-to-sequence (S2S) models.
For our language-independent approach to pronunciation we do not use a lexicon. Instead all pronunciations, including context-based pronunciations, are captured in the S2S model.
arXiv Detail & Related papers (2020-04-10T08:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.