Leveraging Data Collection and Unsupervised Learning for Code-switched
Tunisian Arabic Automatic Speech Recognition
- URL: http://arxiv.org/abs/2309.11327v2
- Date: Mon, 25 Sep 2023 11:20:36 GMT
- Title: Leveraging Data Collection and Unsupervised Learning for Code-switched
Tunisian Arabic Automatic Speech Recognition
- Authors: Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah
Zaiem
- Abstract summary: This paper focuses on the Automatic Speech Recognition (ASR) challenge, focusing on the Tunisian dialect.
First, textual and audio data is collected and in some cases annotated.
Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets.
Third, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling in our testing references.
- Score: 4.67385883375784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Crafting an effective Automatic Speech Recognition (ASR) solution for
dialects demands innovative approaches that not only address the data scarcity
issue but also navigate the intricacies of linguistic diversity. In this paper,
we address the aforementioned ASR challenge, focusing on the Tunisian dialect.
First, textual and audio data is collected and in some cases annotated. Second,
we explore self-supervision, semi-supervision and few-shot code-switching
approaches to push the state-of-the-art on different Tunisian test sets;
covering different acoustic, linguistic and prosodic conditions. Finally, and
given the absence of conventional spelling, we produce a human evaluation of
our transcripts to avoid the noise coming from spelling inadequacies in our
testing references. Our models, allowing to transcribe audio samples in a
linguistic mix involving Tunisian Arabic, English and French, and all the data
used during training and testing are released for public use and further
improvements.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech
Recognition [3.2631198264090746]
Aphasia is a common speech and language disorder, typically caused by a brain injury or a stroke, that affects millions of people worldwide.
We propose an end-to-end pipeline using pre-trained Automatic Speech Recognition (ASR) models that share cross-lingual speech representations.
arXiv Detail & Related papers (2022-04-01T14:05:02Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - TEET! Tunisian Dataset for Toxic Speech Detection [0.0]
The Tunisian dialect is a combination of many other languages like MSA, Tamazight, Italian and French.
Because of its rich language, dealing with NLP problems can be challenging due to the lack of large annotated datasets.
This paper introduces a new annotated dataset composed of approximately 10k of comments.
arXiv Detail & Related papers (2021-10-11T14:00:08Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.