Related papers: Early Attentive Sparsification Accelerates Neural Speech Transcription

Early Attentive Sparsification Accelerates Neural Speech Transcription

URL: http://arxiv.org/abs/2506.15912v1
Date: Wed, 18 Jun 2025 23:16:02 GMT
Title: Early Attentive Sparsification Accelerates Neural Speech Transcription
Authors: Zifei Xu, Sayeh Sharify, Hesham Mostafa, Tristan Webb, Wanzin Yazar, Xin Wang,
Abstract summary: Transformer-based neural speech processing has achieved state-of-the-art performance.<n>We seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage.
Score: 6.074922505142795
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based neural speech processing has achieved state-of-the-art performance. Since speech audio signals are known to be highly compressible, here we seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage, taking advantage of the interpretability of the self-attention mechanism in transformer audio encoders. With the Whisper family of models, we perform a systematic architecture search over the joint space of sparsification stage (a certain encoder layer) and compression ratio (sparsity). We found that the best resulting solutions under 1% accuracy degradation choose to sparsify the hidden state to 40-60% sparsity at an early encoding stage, and thereby achieve up to 1.6x runtime acceleration in English speech transcription tasks on Nvidia GPUs without any fine-tuning.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer [9.914613096064848]
Decoding speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping.<n>In this study, we developed a large-scale multi-utterance speech EEG from 57 right-handed native English-speaking subjects.<n>Our results reveal distinct speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions.
arXiv Detail & Related papers (2025-04-02T10:38:08Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z)
Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework. We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z)
Super-Human Performance in Online Low-latency Recognition of Conversational Speech [18.637636841477]
We present results for a system that can achieve super-human performance at a word based latency of only 1 second behind a speaker's speech. The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.
arXiv Detail & Related papers (2020-10-07T14:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.