TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline
- URL: http://arxiv.org/abs/2206.13135v1
- Date: Mon, 27 Jun 2022 09:30:25 GMT
- Title: TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline
- Authors: Chengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong,
Changbin Chen and Jinfeng Bai
- Abstract summary: This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus.
TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group.
To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition dataset in the world.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new corpus of Mandarin-English code-switching speech
recognition--TALCS corpus, suitable for training and evaluating code-switching
speech recognition systems. TALCS corpus is derived from real online one-to-one
English teaching scenes in TAL education group, which contains roughly 587
hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the
largest well labeled Mandarin-English code-switching open source automatic
speech recognition (ASR) dataset in the world. In this paper, we will introduce
the recording procedure in detail, including audio capturing devices and corpus
environments. And the TALCS corpus is freely available for download under the
permissive license1. Using TALCS corpus, we conduct ASR experiments in two
popular speech recognition toolkits to make a baseline system, including ESPnet
and Wenet. The Mixture Error Rate (MER) performance in the two speech
recognition toolkits is compared in TALCS corpus. The experimental results
implies that the quality of audio recordings and transcriptions are promising
and the baseline system is workable.
Related papers
- Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages [49.6922490267701]
We introduce a new zero resource code-switched speech benchmark designed to assess the code-switching capabilities of self-supervised speech encoders.
We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed.
arXiv Detail & Related papers (2023-10-04T17:58:11Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - BASPRO: a balanced script producer for speech corpus collection based on
the genetic algorithm [29.701197643765674]
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation.
We propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences.
arXiv Detail & Related papers (2022-12-11T02:05:30Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Mandarin-English Code-switching Speech Recognition with Self-supervised
Speech Representation Models [55.82292352607321]
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence.
This paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS.
arXiv Detail & Related papers (2021-10-07T14:43:35Z) - WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition [25.31180901037065]
WenetSpeech is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech.
We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions.
arXiv Detail & Related papers (2021-10-07T12:05:29Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [1.7955614278088239]
KoSpeech is an end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch.
We propose preprocessing methods for KsponSpeech corpus and a baseline model for benchmarks.
Our baseline model achieved 10.31% character error rate (CER) at KsponSpeech corpus only with the acoustic model.
arXiv Detail & Related papers (2020-09-07T13:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.