Code-Switched Urdu ASR for Noisy Telephonic Environment using Data
Centric Approach with Hybrid HMM and CNN-TDNN
- URL: http://arxiv.org/abs/2307.12759v1
- Date: Mon, 24 Jul 2023 13:04:21 GMT
- Title: Code-Switched Urdu ASR for Noisy Telephonic Environment using Data
Centric Approach with Hybrid HMM and CNN-TDNN
- Authors: Muhammad Danyal Khan, Raheem Ali and Arshad Aziz
- Abstract summary: Urdu is the $10th$ most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR.
This paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Call Centers have huge amount of audio data which can be used for achieving
valuable business insights and transcription of phone calls is manually tedious
task. An effective Automated Speech Recognition system can accurately
transcribe these calls for easy search through call history for specific
context and content allowing automatic call monitoring, improving QoS through
keyword search and sentiment analysis. ASR for Call Center requires more
robustness as telephonic environment are generally noisy. Moreover, there are
many low-resourced languages that are on verge of extinction which can be
preserved with help of Automatic Speech Recognition Technology. Urdu is the
$10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide
still remains a resource constrained language in ASR. Regional call-center
conversations operate in local language, with a mix of English numbers and
technical terms generally causing a "code-switching" problem. Hence, this paper
describes an implementation framework of a resource efficient Automatic Speech
Recognition/ Speech to Text System in a noisy call-center environment using
Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid
HMM-DNN approach allowed us to utilize the advantages of Neural Network with
less labelled data. Adding CNN with TDNN has shown to work better in noisy
environment due to CNN's additional frequency dimension which captures extra
information from noisy speech, thus improving accuracy. We collected data from
various open sources and labelled some of the unlabelled data after analysing
its general context and content from Urdu language as well as from commonly
used words from other languages, primarily English and were able to achieve WER
of 5.2% with noisy as well as clean environment in isolated words or numbers as
well as in continuous spontaneous speech.
Related papers
- Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Multilingual acoustic word embeddings for zero-resource languages [1.5229257192293204]
It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments.
The study introduces a new neural network that outperforms existing AWE models on zero-resource languages.
AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts.
arXiv Detail & Related papers (2024-01-19T08:02:37Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - Adversarial synthesis based data-augmentation for code-switched spoken
language identification [0.0]
Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition (ASR)
This study focuses on Indic language code-mixed with English.
Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data.
arXiv Detail & Related papers (2022-05-30T06:41:13Z) - HUI-Audio-Corpus-German: A high quality TTS dataset [0.0]
"HUI-Audio-Corpus-German" is a large, open-source dataset for TTS engines, created with a processing pipeline.
This dataset produces high quality audio to transcription alignments and decreases manual effort needed for creation.
arXiv Detail & Related papers (2021-06-11T10:59:09Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.