Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents
- URL: http://arxiv.org/abs/2204.00977v1
- Date: Sun, 3 Apr 2022 03:11:21 GMT
- Title: Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents
- Authors: Priyank Dubey, Bilal Shah
- Abstract summary: We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Speech Recognition (ASR) is an interdisciplinary application of
computer science and linguistics that enable us to derive the transcription
from the uttered speech waveform. It finds several applications in Military
like High-performance fighter aircraft, helicopters, air-traffic controller.
Other than military speech recognition is used in healthcare, persons with
disabilities and many more. ASR has been an active research area. Several
models and algorithms for speech to text (STT) have been proposed. One of the
most recent is Mozilla Deep Speech, it is based on the Deep Speech research
paper by Baidu. Deep Speech is a state-of-art speech recognition system is
developed using end-to-end deep learning, it is trained using well-optimized
Recurrent Neural Network (RNN) training system utilizing multiple Graphical
Processing Units (GPUs). This training is mostly done using American-English
accent datasets, which results in poor generalizability to other English
accents. India is a land of vast diversity. This can even be seen in the
speech, there are several English accents which vary from state to state. In
this work, we have used transfer learning approach using most recent Deep
Speech model i.e., deepspeech-0.9.3 to develop an end-to-end speech recognition
system for Indian-English accents. This work utilizes fine-tuning and data
argumentation to further optimize and improve the Deep Speech ASR system. Indic
TTS data of Indian-English accents is used for transfer learning and
fine-tuning the pre-trained Deep Speech model. A general comparison is made
among the untrained model, our trained model and other available speech
recognition services for Indian-English Accents.
Related papers
- Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech [0.5330251011543498]
We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers.
We recorded the highest accuracy of 85.44%.
arXiv Detail & Related papers (2024-04-18T10:17:20Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - DeepFry: Identifying Vocal Fry Using Deep Neural Networks [16.489251286870704]
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch.
Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems.
This paper proposes a deep learning model to detect creaky voice in fluent speech.
arXiv Detail & Related papers (2022-03-31T13:23:24Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.