Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
- URL: http://arxiv.org/abs/2602.21741v1
- Date: Wed, 25 Feb 2026 09:52:32 GMT
- Title: Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
- Authors: MD. Sagor Chowdhury, Adiba Fairooz Chowdhury,
- Abstract summary: We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
Related papers
- WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech [0.0]
This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
arXiv Detail & Related papers (2026-03-05T04:54:11Z) - An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization [0.0]
This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle.<n>We implemented Whisper Medium fine-tuned on Bengali data for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model.<n>Results show that targeted tuning and strategic data utilization can significantly improve AI for South Asian languages.
arXiv Detail & Related papers (2026-03-03T17:00:42Z) - A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z) - BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition [0.0]
Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition research.<n>This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT.
arXiv Detail & Related papers (2026-01-25T03:53:14Z) - Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition [2.235406148098187]
This research presents an end-to-end framework for Bengali ASR.<n>It builds on a Conformer-CTC backbone with a multi-level embedding fusion mechanism.<n>The model captures fine-grained phonetic cues and higher-level contextual patterns.
arXiv Detail & Related papers (2025-12-23T04:39:12Z) - A2TTS: TTS for Low Resource Indian Languages [16.782842482372427]
We present a speaker conditioned text-to-speech (TTS) system aimed at generating speech for unseen speakers.<n>Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation.<n>We employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing.
arXiv Detail & Related papers (2025-07-21T06:20:27Z) - CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - A Data-Driven Investigation of Noise-Adaptive Utterance Generation with
Linguistic Modification [25.082714256583422]
In noisy environments, speech can be hard to understand for humans.
We create a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing.
We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB.
arXiv Detail & Related papers (2022-10-19T02:20:17Z) - Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing
N-gram Language Models [0.0]
We show how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor.
We generate a robust Bangla ASR model that is better than the existing ASR models.
arXiv Detail & Related papers (2022-09-13T17:59:21Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.