BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition
- URL: http://arxiv.org/abs/2601.17679v1
- Date: Sun, 25 Jan 2026 03:53:14 GMT
- Title: BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition
- Authors: Md Sazzadul Islam Ridoy, Mubaswira Ibnat Zidney, Sumi Akter, Md. Aminur Rahman,
- Abstract summary: Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition research.<n>This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT, designed to address these challenges. The architecture integrates a diffusion-based denoising module to suppress environmental noise while preserving Bangla-specific phonetic cues, and a contextual cross-attention module that conditions recognition on speaker embeddings for robustness across gender, age, and dialects. Trained end-to-end with a composite objective combining CTC loss, phonetic consistency, and speaker alignment, BanglaRobustNet achieves substantial reductions in word error rate (WER) and character error rate (CER) compared to Wav2Vec-BERT and Whisper baselines. Evaluations on Mozilla Common Voice Bangla and augmented noisy speech confirm the effectiveness of our approach, establishing BanglaRobustNet as a robust ASR system tailored to low-resource, noise-prone linguistic settings.
Related papers
- A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment [0.0]
This paper presents a robust framework specifically engineered for extended Bangla content.<n>Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation.<n>By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
arXiv Detail & Related papers (2026-02-26T12:26:04Z) - SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition [0.8921166277011348]
Single-word Automatic Speech Recognition is a challenging task due to the lack of linguistic context.<n>This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection.<n>We evaluate the framework on the Google Speech Commands dataset and a real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions.
arXiv Detail & Related papers (2026-01-28T04:50:04Z) - WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition [2.235406148098187]
This research presents an end-to-end framework for Bengali ASR.<n>It builds on a Conformer-CTC backbone with a multi-level embedding fusion mechanism.<n>The model captures fine-grained phonetic cues and higher-level contextual patterns.
arXiv Detail & Related papers (2025-12-23T04:39:12Z) - Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR [37.09163295946173]
We propose disentangling semantic speech content from background noise in the latent space.<n>Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors.<n>We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance.
arXiv Detail & Related papers (2025-10-29T04:08:19Z) - From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach [0.0]
We propose an automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing.<n> Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.
arXiv Detail & Related papers (2025-09-02T16:13:29Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.