Drax: Speech Recognition with Discrete Flow Matching
- URL: http://arxiv.org/abs/2510.04162v1
- Date: Sun, 05 Oct 2025 11:38:01 GMT
- Title: Drax: Speech Recognition with Discrete Flow Matching
- Authors: Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya,
- Abstract summary: Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling.<n>We propose Drax, a discrete flow matching framework for automatic speech recognition (ASR)<n>We construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors.
- Score: 26.991421132974097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
Related papers
- ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models [12.527207210862151]
ReHear is a framework for iterative pseudo-label refinement in automatic speech recognition.<n>It integrates an instruction-tuned, audio-aware large language model into the self-training loop.<n>We show that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
arXiv Detail & Related papers (2026-02-21T05:04:22Z) - From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model [72.73512218682187]
We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors.<n>Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop.<n>This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade.
arXiv Detail & Related papers (2025-10-22T06:58:55Z) - SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation [62.14510717860079]
We propose a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion.<n>SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation.<n>Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation.
arXiv Detail & Related papers (2025-10-07T17:29:28Z) - Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction [15.151965172049271]
We propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions.<n>The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2025-08-10T02:36:33Z) - Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application [11.385703484113552]
We propose a novel semantic communication framework empowered by generative artificial intelligence (GAI)<n>A latent diffusion model (LDM)-based semantic communication framework is proposed that combines a variational autoencoder for semantic features extraction.<n>The proposed system is a training-free framework that supports zero-shot generalization, and achieves superior performance under low-SNR and out-of-distribution conditions.
arXiv Detail & Related papers (2025-06-06T03:20:32Z) - Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition [13.189571090038674]
We present a new approach for streaming automatic speech recognition (ASR) using transducer neural networks.
In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules.
We show that FoCCE training improves the accuracy of the streaming transducers.
arXiv Detail & Related papers (2024-11-26T15:53:13Z) - DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval [49.076590578101985]
We present a diffusion-based ATR framework (DiffATR) that generates joint distribution from noise.
Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-16T06:33:26Z) - Observation-Guided Diffusion Probabilistic Models [41.749374023639156]
We propose a novel diffusion-based image generation method called the observation-guided diffusion probabilistic model (OGDM)
Our approach reestablishes the training objective by integrating the guidance of the observation process with the Markov chain.
We demonstrate the effectiveness of our training algorithm using diverse inference techniques on strong diffusion model baselines.
arXiv Detail & Related papers (2023-10-06T06:29:06Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.