Memory-efficient Speech Recognition on Smart Devices
- URL: http://arxiv.org/abs/2102.11531v1
- Date: Tue, 23 Feb 2021 07:43:45 GMT
- Title: Memory-efficient Speech Recognition on Smart Devices
- Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan
Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
- Abstract summary: Recurrent transducer models have emerged as a promising solution for speech recognition on smart devices.
These models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices.
We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs.
- Score: 15.015948023187809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent transducer models have emerged as a promising solution for speech
recognition on the current and next generation smart devices. The transducer
models provide competitive accuracy within a reasonable memory footprint
alleviating the memory capacity constraints in these devices. However, these
models access parameters from off-chip memory for every input time step which
adversely effects device battery life and limits their usability on low-power
devices.
We address transducer model's memory access concerns by optimizing their
model architecture and designing novel recurrent cell designs. We demonstrate
that i) model's energy cost is dominated by accessing model weights from
off-chip memory, ii) transducer model architecture is pivotal in determining
the number of accesses to off-chip memory and just model size is not a good
proxy, iii) our transducer model optimizations and novel recurrent cell reduces
off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy
impact.
Related papers
- Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems [0.0]
Batteryless systems often face power failures, requiring extra runtime buffers to maintain progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs)
We propose FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems.
Our experiments showed that FreeML reduces the model sizes by up to $95 times$, supports adaptive inference with a $2.03-19.65 times$ less memory overhead, and provides significant time and energy benefits with only a negligible accuracy drop compared to the state-of-the-art
arXiv Detail & Related papers (2024-05-16T20:16:45Z) - MEMORYLLM: Towards Self-Updatable Large Language Models [101.3777486749529]
Existing Large Language Models (LLMs) usually remain static after deployment.
We introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool.
MEMORYLLM can self-update with text knowledge and memorize the knowledge injected earlier.
arXiv Detail & Related papers (2024-02-07T07:14:11Z) - Folding Attention: Memory and Power Optimization for On-Device
Transformer-based Streaming Speech Recognition [19.772585241974138]
Streaming speech recognition models usually process a limited number of tokens each time.
bottleneck lies in the linear projection layers of multi-head attention and feedforward networks.
We propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency.
arXiv Detail & Related papers (2023-09-14T19:01:08Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech
Recognition Models [47.99478573698432]
We consider methods to reduce the model size of Conformer-based speech recognition models.
Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.
arXiv Detail & Related papers (2023-03-15T03:21:38Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - A High Throughput Generative Vector Autoregression Model for Stochastic
Synapses [0.0]
We develop a high throughput generative model for synaptic arrays based on electrical measurement data for resistive memory cells.
We demonstrate array sizes above one billion cells and throughputs exceeding one hundred million weight updates per second, above the pixel rate of a 30 frames/s 4K video stream.
arXiv Detail & Related papers (2022-05-10T17:08:30Z) - Improving the Efficiency of Transformers for Resource-Constrained
Devices [1.3019517863608956]
We present a performance analysis of state-of-the-art vision transformers on several devices.
We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x.
arXiv Detail & Related papers (2021-06-30T12:10:48Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - A Compact Gated-Synapse Model for Neuromorphic Circuits [77.50840163374757]
The model is developed in Verilog-A for easy integration into computer-aided design of neuromorphic circuits.
The behavioral theory of the model is described in detail along with a full list of the default parameter settings.
arXiv Detail & Related papers (2020-06-29T18:22:11Z) - Low-rank Gradient Approximation For Memory-Efficient On-device Training
of Deep Neural Network [9.753369031264532]
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models.
One of the major obstacles to achieving this goal is the memory limitation of mobile devices.
We propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory.
arXiv Detail & Related papers (2020-01-24T05:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.