Related papers: Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

URL: http://arxiv.org/abs/2510.25150v1
Date: Wed, 29 Oct 2025 04:08:19 GMT
Title: Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR
Authors: Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng,
Abstract summary: We propose disentangling semantic speech content from background noise in the latent space.<n>Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors.<n>We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance.
Score: 37.09163295946173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

Related papers

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise [9.536089523962486]
We propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise.<n>We show that ANPrompt consistently outperforms existing prompt tuning methods.<n>It offers superior robustness to semantic noise and improved generalization across tasks.
arXiv Detail & Related papers (2025-08-06T17:42:30Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process. A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z)
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z)
Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations. Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder [30.318947721658862]
We propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs. We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters.
arXiv Detail & Related papers (2021-02-17T11:40:42Z)
Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features. At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features. At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.