Multi-Dialect Arabic Speech Recognition
- URL: http://arxiv.org/abs/2112.14678v1
- Date: Sat, 25 Dec 2021 20:55:57 GMT
- Title: Multi-Dialect Arabic Speech Recognition
- Authors: Abbas Raza Ali
- Abstract summary: This paper presents the design and development of multi-dialect automatic speech recognition for Arabic.
Deep neural networks are becoming an effective tool to solve sequential data problems.
The proposed system achieved a 14% error rate which outperforms previous systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the design and development of multi-dialect automatic
speech recognition for Arabic. Deep neural networks are becoming an effective
tool to solve sequential data problems, particularly, adopting an end-to-end
training of the system. Arabic speech recognition is a complex task because of
the existence of multiple dialects, non-availability of large corpora, and
missing vocalization. Thus, the first contribution of this work is the
development of a large multi-dialectal corpus with either full or at least
partially vocalized transcription. Additionally, the open-source corpus has
been gathered from multiple sources that bring non-standard Arabic alphabets in
transcription which are normalized by defining a common character-set. The
second contribution is the development of a framework to train an acoustic
model achieving state-of-the-art performance. The network architecture
comprises of a combination of convolutional and recurrent layers. The
spectrogram features of the audio data are extracted in the frequency vs time
domain and fed in the network. The output frames, produced by the recurrent
model, are further trained to align the audio features with its corresponding
transcription sequences. The sequence alignment is performed using a beam
search decoder with a tetra-gram language model. The proposed system achieved a
14% error rate which outperforms previous systems.
Related papers
- Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - DeepFry: Identifying Vocal Fry Using Deep Neural Networks [16.489251286870704]
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch.
Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems.
This paper proposes a deep learning model to detect creaky voice in fluent speech.
arXiv Detail & Related papers (2022-03-31T13:23:24Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Efficient Weight factorization for Multilingual Speech Recognition [67.00151881207792]
End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages.
Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously.
We propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions.
arXiv Detail & Related papers (2021-05-07T00:12:02Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.