TrimTail: Low-Latency Streaming ASR with Simple but Effective
Spectrogram-Level Length Penalty
- URL: http://arxiv.org/abs/2211.00522v1
- Date: Tue, 1 Nov 2022 15:12:34 GMT
- Title: TrimTail: Low-Latency Streaming ASR with Simple but Effective
Spectrogram-Level Length Penalty
- Authors: Xingchen Song, Di Wu, Zhiyong Wu, Binbin Zhang, Yuekai Zhang, Zhendong
Peng, Wenpeng Li, Fuping Pan, Changbao Zhu
- Abstract summary: We present TrimTail, a simple but effective emission regularization method to improve the latency of streaming ASR models.
We achieve 100 $sim$ 200ms latency reduction with equal or even better accuracy on both Aishell-1 and Librispeech.
- Score: 14.71509986713044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present TrimTail, a simple but effective emission
regularization method to improve the latency of streaming ASR models. The core
idea of TrimTail is to apply length penalty (i.e., by trimming trailing frames,
see Fig. 1-(b)) directly on the spectrogram of input utterances, which does not
require any alignment. We demonstrate that TrimTail is computationally cheap
and can be applied online and optimized with any training loss or any model
architecture on any dataset without any extra effort by applying it on various
end-to-end streaming ASR networks either trained with CTC loss [1] or
Transducer loss [2]. We achieve 100 $\sim$ 200ms latency reduction with equal
or even better accuracy on both Aishell-1 and Librispeech. Moreover, by using
TrimTail, we can achieve a 400ms algorithmic improvement of User Sensitive
Delay (USD) with an accuracy loss of less than 0.2.
Related papers
- CR-SAM: Curvature Regularized Sharpness-Aware Minimization [8.248964912483912]
Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation.
In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on em both training and test sets.
In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM)
arXiv Detail & Related papers (2023-12-21T03:46:29Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Minimum Latency Training of Sequence Transducers for Streaming
End-to-End Speech Recognition [38.28868751443619]
We propose a new training method to explicitly model and reduce the latency of sequence transducer models.
Experimental results show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%.
arXiv Detail & Related papers (2022-11-04T09:19:59Z) - Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation
Approach [132.37966970098645]
One of the popular solutions is Sharpness-Aware Minimization (SAM), which minimizes the change of weight loss when adding a perturbation.
In this paper, we propose an efficient effective training scheme coined as Sparse SAM (SSAM), which achieves double overhead of common perturbations.
In addition, we theoretically prove that S can converge at the same SAM, i.e., $O(log T/sqrtTTTTTTTTTTTTTTTTT
arXiv Detail & Related papers (2022-10-11T06:30:10Z) - Structured Directional Pruning via Perturbation Orthogonal Projection [13.704348351073147]
A more reasonable approach is to find a sparse minimizer along the flat minimum valley found byNIST.
We propose the structured directional pruning based on projecting the perturbations onto the flat minimum valley.
Experiments show that our method obtains the state-of-the-art pruned accuracy (i.e. 93.97% on VGG16, CIFAR-10 task) without retraining.
arXiv Detail & Related papers (2021-07-12T11:35:47Z) - Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design.
Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars.
EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector.
It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.