Related papers: Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

URL: http://arxiv.org/abs/2602.21741v1
Date: Wed, 25 Feb 2026 09:52:32 GMT
Title: Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
Authors: MD. Sagor Chowdhury, Adiba Fairooz Chowdhury,
Abstract summary: We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.

Related papers

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech [0.0]
This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
arXiv Detail & Related papers (2026-03-05T04:54:11Z)
An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization [0.0]
This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle.<n>We implemented Whisper Medium fine-tuned on Bengali data for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model.<n>Results show that targeted tuning and strategic data utilization can significantly improve AI for South Asian languages.
arXiv Detail & Related papers (2026-03-03T17:00:42Z)
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z)
BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition [0.0]
Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition research.<n>This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT.
arXiv Detail & Related papers (2026-01-25T03:53:14Z)
Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition [2.235406148098187]
This research presents an end-to-end framework for Bengali ASR.<n>It builds on a Conformer-CTC backbone with a multi-level embedding fusion mechanism.<n>The model captures fine-grained phonetic cues and higher-level contextual patterns.
arXiv Detail & Related papers (2025-12-23T04:39:12Z)
A2TTS: TTS for Low Resource Indian Languages [16.782842482372427]
We present a speaker conditioned text-to-speech (TTS) system aimed at generating speech for unseen speakers.<n>Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation.<n>We employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing.
arXiv Detail & Related papers (2025-07-21T06:20:27Z)
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification [25.082714256583422]
In noisy environments, speech can be hard to understand for humans. We create a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing. We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB.
arXiv Detail & Related papers (2022-10-19T02:20:17Z)
Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models [0.0]
We show how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor. We generate a robust Bangla ASR model that is better than the existing ASR models.
arXiv Detail & Related papers (2022-09-13T17:59:21Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.