ArTST: Arabic Text and Speech Transformer
- URL: http://arxiv.org/abs/2310.16621v1
- Date: Wed, 25 Oct 2023 13:20:54 GMT
- Title: ArTST: Arabic Text and Speech Transformer
- Authors: Hawau Olamide Toyin, Amirbek Djanibekov, Ajinkya Kulkarni, Hanan
Aldarmaki
- Abstract summary: We present ArTST, a pre-trained Arabic text and speech transformer.
It supports open-source speech technologies for the Arabic language.
- Score: 2.53638770809417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ArTST, a pre-trained Arabic text and speech transformer for
supporting open-source speech technologies for the Arabic language. The model
architecture follows the unified-modal framework, SpeechT5, that was recently
released for English, and is focused on Modern Standard Arabic (MSA), with
plans to extend the model for dialectal and code-switched Arabic in future
editions. We pre-trained the model from scratch on MSA speech and text data,
and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR),
Text-To-Speech synthesis (TTS), and spoken dialect identification. In our
experiments comparing ArTST with SpeechT5, as well as with previously reported
results in these tasks, ArTST performs on a par with or exceeding the current
state-of-the-art in all three tasks. Moreover, we find that our pre-training is
conducive for generalization, which is particularly evident in the low-resource
TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models
are released for research use.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - End-to-End Speech Translation of Arabic to English Broadcast News [2.375764121997739]
Speech translation (ST) is the task of translating acoustic speech signals in a source language into text in a foreign language.
This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system.
arXiv Detail & Related papers (2022-12-11T11:35:46Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - ASR-Generated Text for Language Model Pre-training Applied to Speech
Tasks [20.83731188652985]
We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows.
New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing.
arXiv Detail & Related papers (2022-07-05T08:47:51Z) - AraT5: Text-to-Text Transformers for Arabic Language Understanding and
Generation [6.021269454707625]
We introduce a new benchmark for Arabic language generation (ARGEN)
We pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks.
Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding.
arXiv Detail & Related papers (2021-08-31T02:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.