Related papers: Tied & Reduced RNN-T Decoder

Tied & Reduced RNN-T Decoder

URL: http://arxiv.org/abs/2109.07513v1
Date: Wed, 15 Sep 2021 18:19:16 GMT
Title: Tied & Reduced RNN-T Decoder
Authors: Rami Botros (1), Tara N. Sainath (1), Robert David (1), Emmanuel Guzman (1), Wei Li (1), Yanzhang He (1) ((1) Google Inc. USA)
Abstract summary: We study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer. This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER)
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications. In this work, we study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer (a.k.a. weight tying, commonly used in language modeling arXiv:1611.01462 [cs.LG]). This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER).

Related papers

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC) We show they are exact for deep linear networks with weight-sharing in their respective setting. We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z)
MST-compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree [21.15961593182111]
Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices. However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version. This paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs.
arXiv Detail & Related papers (2023-08-26T02:42:12Z)
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z)
a novel attention-based network for fast salient object detection [14.246237737452105]
In the current salient object detection network, the most popular method is using U-shape structure. We propose a new deep convolution network architecture with three contributions. Results demonstrate that the proposed method can compress the model to 1/3 of the original size nearly without losing the accuracy.
arXiv Detail & Related papers (2021-12-20T12:30:20Z)
Compact representations of convolutional neural networks via weight pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization. We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z)
FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation [31.546529106932205]
Frequency-Aware Transformation (FAT) learns to transform network weights in the frequency domain before quantization. FAT can be easily trained in low precision using simple standard quantizers. Code will be available soon.
arXiv Detail & Related papers (2021-02-15T10:35:20Z)
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell. This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference [24.351577383531616]
We introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs) An RNNPool layer can effectively replace multiple blocks in a variety of architectures like MobileNets, DenseNet when applied to standard vision tasks like image classification and face detection. We use RNNPool with the standard S3FD architecture to construct a face detection method that achieves state-of-the-art MAP for tiny ARM Cortex-M4 class microcontrollers with under 256 KB of RAM.
arXiv Detail & Related papers (2020-02-27T05:22:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.