Related papers: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

URL: http://arxiv.org/abs/2501.15907v1
Date: Mon, 27 Jan 2025 09:59:20 GMT
Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu,
Abstract summary: Emilia is the first multilingual speech generation dataset derived from in-the-wild speech data. We expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available.
Score: 26.569097905515033
License:
Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Related papers

Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction. Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z)
A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives [2.3592914313389257]
We are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models.
arXiv Detail & Related papers (2024-07-24T11:03:47Z)
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation [26.569097905515033]
Emilia is the first large-scale, multilingual, and diverse speech generation dataset. It starts with over 101k hours of speech across six languages, covering a wide range of speaking styles. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline.
arXiv Detail & Related papers (2024-07-07T13:24:54Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Direct Punjabi to English speech translation using discrete units [4.883313216485195]
We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
arXiv Detail & Related papers (2024-02-25T03:03:34Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z)
Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. Main ingredients are a new dataset based on readings of publicly available religious texts. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z)
Sentiment recognition of Italian elderly through domain adaptation on cross-corpus speech dataset [77.99182201815763]
The aim of this work is to define a speech emotion recognition (SER) model able to recognize positive, neutral and negative emotions in natural conversations of Italian elderly people.
arXiv Detail & Related papers (2022-11-14T12:39:41Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.