Related papers: A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

URL: http://arxiv.org/abs/2602.22935v1
Date: Thu, 26 Feb 2026 12:26:04 GMT
Title: A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Authors: Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin,
Abstract summary: This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.

Related papers

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech [0.0]
This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
arXiv Detail & Related papers (2026-03-05T04:54:11Z)
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment [0.0]
We introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset.<n>For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation.<n>For speaker diarization, we observed that global open-source state-of-the-art models performed surprisingly poorly on this complex dataset.
arXiv Detail & Related papers (2026-02-26T14:59:24Z)
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization [0.0]
We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
arXiv Detail & Related papers (2026-02-25T09:52:32Z)
VIBEVOICE-ASR Technical Report [95.57263110940973]
VibeVoice-ASR addresses challenges of context fragmentation and multi-speaker complexity in long-form audio.<n>It supports over 50 languages, requires no explicit language setting, and handles code-switching within and across utterances.
arXiv Detail & Related papers (2026-01-26T06:11:51Z)
BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition [0.0]
Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition research.<n>This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT.
arXiv Detail & Related papers (2026-01-25T03:53:14Z)
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects [0.0]
We present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects.<n> BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication.<n>It can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds.
arXiv Detail & Related papers (2025-10-07T17:47:39Z)
VibeVoice Technical Report [90.14596405668135]
VibeVoice is a model designed to synthesize long-form speech with multiple speakers.<n>We introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times.
arXiv Detail & Related papers (2025-08-26T17:09:12Z)
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice [52.747242157396315]
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
arXiv Detail & Related papers (2025-07-23T14:07:41Z)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs. XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.