Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
- URL: http://arxiv.org/abs/2602.23070v1
- Date: Thu, 26 Feb 2026 14:59:24 GMT
- Title: Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
- Authors: Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan,
- Abstract summary: We introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset.<n>For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation.<n>For speaker diarization, we observed that global open-source state-of-the-art models performed surprisingly poorly on this complex dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Related papers
- WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech [0.0]
This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
arXiv Detail & Related papers (2026-03-05T04:54:11Z) - A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z) - Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition [2.235406148098187]
This research presents an end-to-end framework for Bengali ASR.<n>It builds on a Conformer-CTC backbone with a multi-level embedding fusion mechanism.<n>The model captures fine-grained phonetic cues and higher-level contextual patterns.
arXiv Detail & Related papers (2025-12-23T04:39:12Z) - Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [63.741916531380696]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z) - Cocktail-Party Audio-Visual Speech Recognition [58.222892601847924]
This study introduces a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems.<n>We contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments.<n>Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
arXiv Detail & Related papers (2025-06-02T19:07:51Z) - Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation [18.678742816040856]
Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set.<n>We examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model.<n>We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models.
arXiv Detail & Related papers (2025-05-27T00:55:32Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Extending Whisper with prompt tuning to target-speaker ASR [18.31992429200396]
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from overlapped utterances.
Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model.
This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR.
arXiv Detail & Related papers (2023-12-13T11:49:16Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.