Related papers: Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

URL: http://arxiv.org/abs/2509.16622v2
Date: Thu, 09 Oct 2025 07:55:28 GMT
Title: Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Authors: Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland,
Abstract summary: We present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR)<n>We explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline.<n>Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower.
Score: 33.36615989947073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

Related papers

Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods [0.15749416770494704]
Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets.<n>Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER.<n>This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition.
arXiv Detail & Related papers (2026-02-05T18:46:28Z)
SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models [49.313324100819955]
Signal Embedding Energy (SEE) is a method for quantifying the impact of noise intensity on LALM inputs.<n>SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98.<n>This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
arXiv Detail & Related papers (2026-01-12T08:57:55Z)
Reproducing and Dissecting Denoising Language Models for Speech Recognition [31.91567892562116]
Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR)<n>This paper presents the first independent, large-scale empirical study of DLMs.
arXiv Detail & Related papers (2025-12-15T17:33:22Z)
MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization [66.82303841930752]
diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs)<n>DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases.<n>We propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process.
arXiv Detail & Related papers (2025-10-24T13:57:59Z)
Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition [46.438575751932866]
LipGER is a framework for leveraging visual cues for noise-robust ASR. We show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
arXiv Detail & Related papers (2024-06-06T18:17:59Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z)
Cross-Utterance Language Models with Acoustic Error Sampling [1.376408511310322]
Cross-utterance LM (CULM) is proposed to augment the input to a standard long short-term memory (LSTM) LM. An acoustic error sampling technique is proposed to reduce the mismatch between training and test-time. Experiments performed on both AMI and Switchboard datasets show that CULMs outperform the LSTM LM baseline WER.
arXiv Detail & Related papers (2020-08-19T17:40:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.