Lightweight Transducer Based on Frame-Level Criterion
- URL: http://arxiv.org/abs/2409.13698v2
- Date: Fri, 1 Nov 2024 06:08:08 GMT
- Title: Lightweight Transducer Based on Frame-Level Criterion
- Authors: Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye,
- Abstract summary: We propose a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame.
To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities.
Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer.
- Score: 14.518972562566642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
Related papers
- Threshold Selection for Iterative Decoding of $(v,w)$-regular Binary Codes [84.0257274213152]
Iterative bit flipping decoders are an efficient choice for sparse $(v,w)$-regular codes.
We propose concrete criteria for threshold determination, backed by a closed form model.
arXiv Detail & Related papers (2025-01-23T17:38:22Z) - Cluster Decomposition for Improved Erasure Decoding of Quantum LDPC Codes [7.185960422285947]
We introduce a new erasure decoder that applies to arbitrary quantum LDPC codes.
By allowing clusters of unconstrained size, this decoder achieves maximum-likelihood (ML) performance.
For the general quantum LDPC codes we studied, the cluster decoder can be used to estimate the ML performance curve.
arXiv Detail & Related papers (2024-12-11T23:14:23Z) - The Conformer Encoder May Reverse the Time Dimension [53.9351497436903]
We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention.
We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments.
arXiv Detail & Related papers (2024-10-01T13:39:05Z) - Label-Looping: Highly Efficient Decoding for Transducers [19.091932566833265]
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models.
Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32.
arXiv Detail & Related papers (2024-06-10T12:34:38Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - FSR: Accelerating the Inference Process of Transducer-Based Models by
Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state.
The number of blank tokens in the prediction results accounts for nearly 90% of all tokens.
We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.