BASPRO: a balanced script producer for speech corpus collection based on
the genetic algorithm
- URL: http://arxiv.org/abs/2301.04120v1
- Date: Sun, 11 Dec 2022 02:05:30 GMT
- Title: BASPRO: a balanced script producer for speech corpus collection based on
the genetic algorithm
- Authors: Yu-Wen Chen, Hsin-Min Wang, Yu Tsao
- Abstract summary: The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation.
We propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences.
- Score: 29.701197643765674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of speech-processing models is heavily influenced by the
speech corpus that is used for training and evaluation. In this study, we
propose BAlanced Script PROducer (BASPRO) system, which can automatically
construct a phonetically balanced and rich set of Chinese sentences for
collecting Mandarin Chinese speech data. First, we used pretrained natural
language processing systems to extract ten-character candidate sentences from a
large corpus of Chinese news texts. Then, we applied a genetic algorithm-based
method to select 20 phonetically balanced sentence sets, each containing 20
sentences, from the candidate sentences. Using BASPRO, we obtained a recording
script called TMNews, which contains 400 ten-character sentences. TMNews covers
84% of the syllables used in the real world. Moreover, the syllable
distribution has 0.96 cosine similarity to the real-world syllable
distribution. We converted the script into a speech corpus using two
text-to-speech systems. Using the designed speech corpus, we tested the
performances of speech enhancement (SE) and automatic speech recognition (ASR),
which are one of the most important regression- and classification-based speech
processing tasks, respectively. The experimental results show that the SE and
ASR models trained on the designed speech corpus outperform their counterparts
trained on a randomly composed speech corpus.
Related papers
- AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline [0.0]
This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus.
TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group.
To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition dataset in the world.
arXiv Detail & Related papers (2022-06-27T09:30:25Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Construction of a Large-scale Japanese ASR Corpus on TV Recordings [2.28438857884398]
This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems.
This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles.
arXiv Detail & Related papers (2021-03-26T21:14:12Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - FT Speech: Danish Parliament Speech Corpus [21.190182627955817]
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament.
The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.
It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish.
arXiv Detail & Related papers (2020-05-25T19:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.