Related papers: Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models

Related papers

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment [0.0]
We introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset.<n>For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation.<n>For speaker diarization, we observed that global open-source state-of-the-art models performed surprisingly poorly on this complex dataset.
arXiv Detail & Related papers (2026-02-26T14:59:24Z)
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization [0.0]
We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
arXiv Detail & Related papers (2026-02-25T09:52:32Z)
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects [0.0]
We present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects.<n> BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication.<n>It can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds.
arXiv Detail & Related papers (2025-10-07T17:47:39Z)
A2TTS: TTS for Low Resource Indian Languages [16.782842482372427]
We present a speaker conditioned text-to-speech (TTS) system aimed at generating speech for unseen speakers.<n>Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation.<n>We employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing.
arXiv Detail & Related papers (2025-07-21T06:20:27Z)
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models. It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS) Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z)
An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework. The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z)
Thai Wav2Vec2.0 with CommonVoice V8 [7.818074118880726]
Most publicly available Automatic Speech Recognition (ASR) models are available in English; only a minority of the models are available in Thai. Most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness. We train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model.
arXiv Detail & Related papers (2022-08-09T14:21:48Z)
ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks [20.83731188652985]
We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing.
arXiv Detail & Related papers (2022-07-05T08:47:51Z)
Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally. Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets. We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)
Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla [0.0]
Byakta is the first-ever open-source deep learning-based bilingual (Bangla and English) text to a speech synthesis system. A speech recognition model-based automated scoring metric was also proposed to evaluate the performance of a TTS model. We introduce a test benchmark dataset for Bangla speech synthesis models for evaluating speech quality.
arXiv Detail & Related papers (2021-05-31T20:39:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.