Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting
- URL: http://arxiv.org/abs/2510.00982v1
- Date: Wed, 01 Oct 2025 14:56:45 GMT
- Title: Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting
- Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe,
- Abstract summary: A Transformer-based encoder has been widely used with block processing.<n>We propose a new encoder, Spiralformer, tailored for block processing by combining layer dropping and early exiting.<n> Experimentally, we observed that our method achieved 21.6% reduction in the averaged token emission delay in Librispeech.
- Score: 70.75913449565203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For streaming speech recognition, a Transformer-based encoder has been widely used with block processing. Although many studies addressed improving emission latency of transducers, little work has been explored for improving encoding latency of the block processing. We seek to reduce latency by frequently emitting a chunk with a small shift rather than scarce large-chunk emissions, resulting in higher computational costs. To efficiently compute with the small chunk shift, we propose a new encoder, Spiralformer, tailored for block processing by combining layer dropping and early exiting. We skip layer computation in a cyclic manner and shift the computed layer in each block spirally, which completes computation for all the layers over the block processing. Experimentally, we observed that our method achieved 21.6% reduction in the averaged token emission delay in Librispeech, and 7.0% in CSJ, compared with the baseline with similar computational cost and word error rates.
Related papers
- Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z) - Fast correlated decoding of transversal logical algorithms [67.01652927671279]
Quantum error correction (QEC) is required for large-scale computation, but incurs a significant resource overhead.<n>Recent advances have shown that by jointly decoding logical qubits in algorithms composed of logical gates, the number of syndrome extraction rounds can be reduced.<n>Here, we reform the problem of decoding circuits by directly decoding relevant logical operator products as they propagate through the circuit.
arXiv Detail & Related papers (2025-05-19T18:00:00Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers [21.91815582658188]
Large scale language models are delivering unprecedented performance on almost all natural language processing tasks.
The overwhelming complexity incurs a high inference latency that negatively affects user experience.
We propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency.
arXiv Detail & Related papers (2024-04-10T03:30:01Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Parallel window decoding enables scalable fault tolerant quantum
computation [2.624902795082451]
We present a methodology that parallelizes the decoding problem and achieves almost arbitrary syndrome processing speed.
Our parallelization requires some classical feedback decisions to be delayed, leading to a slow-down of the logical clock speed.
Using known auto-teleportation gadgets the slow-down can be eliminated altogether in exchange for increased qubit overhead.
arXiv Detail & Related papers (2022-09-18T12:37:57Z) - Streaming parallel transducer beam search with fast-slow cascaded
encoders [23.416682253435837]
Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
arXiv Detail & Related papers (2022-03-29T17:29:39Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.