Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
- URL: http://arxiv.org/abs/2311.03196v1
- Date: Mon, 6 Nov 2023 15:37:14 GMT
- Title: Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
- Authors: Rabindra Nath Nandi, Mehadi Hasan Menon, Tareq Al Muntasir, Sagor
Sarker, Quazi Sarwar Muhtaseem, Md. Tariqul Islam, Shammur Absar Chowdhury,
Firoj Alam
- Abstract summary: In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset.
We developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios.
We benchmarked the trained ASR with publicly available datasets and compared it with other available models.
Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets.
- Score: 10.244515100904144
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: One of the major challenges for developing automatic speech recognition (ASR)
for low-resource languages is the limited access to labeled data with
domain-specific variations. In this study, we propose a pseudo-labeling
approach to develop a large-scale domain-agnostic ASR dataset. With the
proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset
covering diverse topics, speaking styles, dialects, noisy environments, and
conversational scenarios. We then exploited the developed corpus to design a
conformer-based ASR system. We benchmarked the trained ASR with publicly
available datasets and compared it with other available models. To investigate
the efficacy, we designed and developed a human-annotated domain-agnostic test
set composed of news, telephony, and conversational data among others. Our
results demonstrate the efficacy of the model trained on psuedo-label data for
the designed test-set along with publicly-available Bangla datasets. The
experimental resources will be publicly
available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling [24.870429379543193]
We tackle the challenge of limited labeled data for low-resource languages in ASR, focusing on Hindi.
Our framework integrates multiple base models for transcription and evaluators for assessing audio-transcript pairs, resulting in robust pseudo-labeling for low resource languages.
We validate our approach with a new benchmark, IndicYT, comprising diverse YouTube audio files from multiple content categories.
arXiv Detail & Related papers (2024-08-26T05:36:35Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Towards Open-Domain Topic Classification [69.21234350688098]
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time.
Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface.
arXiv Detail & Related papers (2023-06-29T20:25:28Z) - Towards hate speech detection in low-resource languages: Comparing ASR
to acoustic word embeddings on Wolof and Swahili [16.424308444697015]
We consider hate speech detection through keyword spotting on radio broadcasts.
One approach is to build an automatic speech recognition system for the target low-resource language.
We compare this to using acoustic word embedding models that map speech segments to a space where matching words have similar vectors.
arXiv Detail & Related papers (2023-06-01T07:25:10Z) - Effectiveness of text to speech pseudo labels for forced alignment and
cross lingual pretrained models for low resource speech recognition [0.0]
We present an approach to create labelled data for Maithili, Bhojpuri and Dogri.
All data and models are available in open domain.
arXiv Detail & Related papers (2022-03-31T06:12:52Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.