Tied & Reduced RNN-T Decoder
- URL: http://arxiv.org/abs/2109.07513v1
- Date: Wed, 15 Sep 2021 18:19:16 GMT
- Title: Tied & Reduced RNN-T Decoder
- Authors: Rami Botros (1), Tara N. Sainath (1), Robert David (1), Emmanuel
Guzman (1), Wei Li (1), Yanzhang He (1) ((1) Google Inc. USA)
- Abstract summary: We study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance.
Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer.
This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER)
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have
shown that, under some conditions, it is possible to simplify its prediction
network with little or no loss in recognition accuracy (arXiv:2003.07705
[eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context
size of previous labels and/or using a simpler architecture for its layers
instead of LSTMs. The benefits of such changes include reduction in model size,
faster inference and power savings, which are all useful for on-device
applications.
In this work, we study ways to make the RNN-T decoder (prediction network +
joint network) smaller and faster without degradation in recognition
performance. Our prediction network performs a simple weighted averaging of the
input embeddings, and shares its embedding matrix weights with the joint
network's output layer (a.k.a. weight tying, commonly used in language modeling
arXiv:1611.01462 [cs.LG]). This simple design, when used in conjunction with
additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T
Decoder from 23M parameters to just 2M, without affecting word-error rate
(WER).
Related papers
- Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - MST-compression: Compressing and Accelerating Binary Neural Networks
with Minimum Spanning Tree [21.15961593182111]
Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices.
However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version.
This paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs.
arXiv Detail & Related papers (2023-08-26T02:42:12Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - a novel attention-based network for fast salient object detection [14.246237737452105]
In the current salient object detection network, the most popular method is using U-shape structure.
We propose a new deep convolution network architecture with three contributions.
Results demonstrate that the proposed method can compress the model to 1/3 of the original size nearly without losing the accuracy.
arXiv Detail & Related papers (2021-12-20T12:30:20Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware
Transformation [31.546529106932205]
Frequency-Aware Transformation (FAT) learns to transform network weights in the frequency domain before quantization.
FAT can be easily trained in low precision using simple standard quantizers.
Code will be available soon.
arXiv Detail & Related papers (2021-02-15T10:35:20Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell.
This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference [24.351577383531616]
We introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs)
An RNNPool layer can effectively replace multiple blocks in a variety of architectures like MobileNets, DenseNet when applied to standard vision tasks like image classification and face detection.
We use RNNPool with the standard S3FD architecture to construct a face detection method that achieves state-of-the-art MAP for tiny ARM Cortex-M4 class microcontrollers with under 256 KB of RAM.
arXiv Detail & Related papers (2020-02-27T05:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.