VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System
- URL: http://arxiv.org/abs/2310.11069v4
- Date: Fri, 27 Oct 2023 13:32:15 GMT
- Title: VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System
- Authors: Abdul Waheed, Bashar Talafha, Peter Sullivan, AbdelRahim Elmadany,
Muhammad Abdul-Mageed
- Abstract summary: VoxArabica is a system for dialect identification (DID) and automatic speech recognition (ASR) of Arabic.
We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks.
We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.
We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selection, and the option to raise flags for incorrect outputs.
- Score: 16.420831300734697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Arabic is a complex language with many varieties and dialects spoken by over
450 millions all around the world. Due to the linguistic diversity and
variations, it is challenging to build a robust and generalized ASR system for
Arabic. In this work, we address this gap by developing and demoing a system,
dubbed VoxArabica, for dialect identification (DID) as well as automatic speech
recognition (ASR) of Arabic. We train a wide range of models such as HuBERT
(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR
tasks. Our DID models are trained to identify 17 different dialects in addition
to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.
Additionally, for the remaining dialects in ASR, we provide the option to
choose various models such as Whisper and MMS in a zero-shot setting. We
integrate these models into a single web interface with diverse features such
as audio recording, file upload, model selection, and the option to raise flags
for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide
range of audiences concerned with Arabic research. Our system is currently
running at https://cdce-206-12-100-168.ngrok.io/.
Related papers
- Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition [8.731646409966737]
It is not clear how Whisper would fare under diverse conditions even on languages it was evaluated on such as Arabic.
Our evaluation covers most publicly available Arabic speech data and is performed under n-shot finetuning.
We also investigate the robustness of Whisper under completely novel conditions, such as in dialect-accented standard Arabic and in unseen dialects.
arXiv Detail & Related papers (2023-06-05T14:09:25Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - SD-QA: Spoken Dialectal Question Answering for the Real World [15.401330338654203]
We build a benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers.
We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance.
Last, we study the fairness of the ASR and QA models with respect to the underlying user populations.
arXiv Detail & Related papers (2021-09-24T16:54:27Z) - Towards One Model to Rule All: Multilingual Strategy for Dialectal
Code-Switching Arabic ASR [11.363966269198064]
We design a large multilingual end-to-end ASR using self-attention based conformer architecture.
We trained the system using Arabic (Ar), English (En) and French (Fr) languages.
Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
arXiv Detail & Related papers (2021-05-31T08:20:38Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.