Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
- URL: http://arxiv.org/abs/2602.12241v1
- Date: Thu, 12 Feb 2026 18:20:45 GMT
- Title: Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
- Authors: Manjunath Kudlur, Evan King, James Wang, Pete Warden,
- Abstract summary: Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR)<n>We introduce v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference.<n>Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster.
- Score: 0.8691520242484038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
Related papers
- Whisfusion: Parallel ASR Decoding via a Diffusion Transformer [7.327454599174306]
Whisfusion is a framework to fuse a pre-trained Whisper encoder with a text diffusion decoder.<n>A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities.<n>Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny, and offers comparable latency on short audio.
arXiv Detail & Related papers (2025-08-09T17:20:54Z) - READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z) - Streaming parallel transducer beam search with fast-slow cascaded
encoders [23.416682253435837]
Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
arXiv Detail & Related papers (2022-03-29T17:29:39Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Multi-rate attention architecture for fast streamable Text-to-speech
spectrum modeling [5.080331097831114]
High quality text-to-speech (TTS) systems use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio.
While these models can produce high quality speech, they often incur OL$ in both latency and real-time factor (RTF) with respect to input length.
We propose a multi-rate architecture that breaks the latency bottlenecks by encoding a compact representation during streaming.
arXiv Detail & Related papers (2021-04-01T18:15:30Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.