Related papers: fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

URL: http://arxiv.org/abs/2109.06912v1
Date: Tue, 14 Sep 2021 18:20:28 GMT
Title: fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit
Authors: Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino
Abstract summary: fairseq S2 is a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built.
Score: 60.74922995613379
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq S^2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.

Related papers

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types. To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z)
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [7.2129341612013285]
We introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA) This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes.
arXiv Detail & Related papers (2024-10-30T04:50:40Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z)
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z)
On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system. We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z)
Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities. We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG) Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z)
fairseq S2T: Fast Speech-to-Text Modeling with fairseq [36.728277923926875]
Fairseq S2T is a fairseq extension for speech-to-text (S2T) modeling tasks. It provides end-to-end from data pre-processing to offline (online) inference.
arXiv Detail & Related papers (2020-10-11T05:36:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.