QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus
- URL: http://arxiv.org/abs/2106.13000v1
- Date: Thu, 24 Jun 2021 13:20:40 GMT
- Title: QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus
- Authors: Hamdy Mubarak, Amir Hussein, Shammur Absar Chowdhury, Ahmed Ali
- Abstract summary: We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
- Score: 11.113497373432411
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce the largest transcribed Arabic speech corpus, QASR, collected
from the broadcast domain. This multi-dialect speech dataset contains 2,000
hours of speech sampled at 16kHz crawled from Aljazeera news channel. The
dataset is released with lightly supervised transcriptions, aligned with the
audio segments. Unlike previous datasets, QASR contains linguistically
motivated segmentation, punctuation, speaker information among others. QASR is
suitable for training and evaluating speech recognition systems, acoustics-
and/or linguistics- based Arabic dialect identification, punctuation
restoration, speaker identification, speaker linking, and potentially other NLP
modules for spoken data. In addition to QASR transcription, we release a
dataset of 130M words to aid in designing and training a better language model.
We show that end-to-end automatic speech recognition trained on QASR reports a
competitive word error rate compared to the previous MGB-2 corpus. We report
baseline results for downstream natural language processing tasks such as named
entity recognition using speech transcript. We also report the first baseline
for Arabic punctuation restoration. We make the corpus available for the
research community.
Related papers
- Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech
Recognition [25.31180901037065]
WenetSpeech is a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech.
We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions.
arXiv Detail & Related papers (2021-10-07T12:05:29Z) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech
Recognition Baseline [4.521450956414864]
The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups.
The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications.
arXiv Detail & Related papers (2020-09-22T05:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.