Related papers: WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

URL: http://arxiv.org/abs/2603.04809v1
Date: Thu, 05 Mar 2026 04:54:11 GMT
Title: WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech
Authors: Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees,
Abstract summary: This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.

Related papers

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization [0.0]
This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle.<n>We implemented Whisper Medium fine-tuned on Bengali data for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model.<n>Results show that targeted tuning and strategic data utilization can significantly improve AI for South Asian languages.
arXiv Detail & Related papers (2026-03-03T17:00:42Z)
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment [0.0]
We introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset.<n>For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation.<n>For speaker diarization, we observed that global open-source state-of-the-art models performed surprisingly poorly on this complex dataset.
arXiv Detail & Related papers (2026-02-26T14:59:24Z)
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z)
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization [0.0]
We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
arXiv Detail & Related papers (2026-02-25T09:52:32Z)
VIBEVOICE-ASR Technical Report [95.57263110940973]
VibeVoice-ASR addresses challenges of context fragmentation and multi-speaker complexity in long-form audio.<n>It supports over 50 languages, requires no explicit language setting, and handles code-switching within and across utterances.
arXiv Detail & Related papers (2026-01-26T06:11:51Z)
Continual Speech Learning with Fused Speech Features [49.21227244653524]
We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap in current speech models.<n>We use the encoder-decoder Whisper model to standardize speech tasks into a generative format.<n>Our approach improves accuracy significantly over traditional methods in six speech processing tasks, demonstrating gains in adapting to new speech tasks without full retraining.
arXiv Detail & Related papers (2025-06-02T09:59:35Z)
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models [49.725968706743586]
WavRAG is the first retrieval augmented generation framework with native, end-to-end audio support.<n>We propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base.<n>In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration.
arXiv Detail & Related papers (2025-02-20T16:54:07Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.