Related papers: An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

URL: http://arxiv.org/abs/2603.03158v1
Date: Tue, 03 Mar 2026 17:00:42 GMT
Title: An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization
Authors: Epshita Jahan, Khandoker Md Tanjinul Islam, Pritom Biswas, Tafsir Al Nafin,
Abstract summary: This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle.<n>We implemented Whisper Medium fine-tuned on Bengali data for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model.<n>Results show that targeted tuning and strategic data utilization can significantly improve AI for South Asian languages.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle, addressing the challenge of "who spoke when/what" in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post-processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: https://github.com/Short-Potatoes/Bengali-long-form-transcription-and-diarization.git Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low-resource languages, pyannote, voice activity detection

Related papers

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech [0.0]
This paper addresses the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization.<n>We implement a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription.<n>For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX.
arXiv Detail & Related papers (2026-03-05T04:54:11Z)
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization [0.0]
We describe our end-to-end system for Bengali long-form speech recognition and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.<n> Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora.<n>Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
arXiv Detail & Related papers (2026-02-25T09:52:32Z)
A2TTS: TTS for Low Resource Indian Languages [16.782842482372427]
We present a speaker conditioned text-to-speech (TTS) system aimed at generating speech for unseen speakers.<n>Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation.<n>We employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing.
arXiv Detail & Related papers (2025-07-21T06:20:27Z)
Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework. The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z)
Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally. Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets. We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z)
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z)
Multilingual and code-switching ASR challenges for low resource Indian languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages. We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.