Memory-efficient Speech Recognition on Smart Devices
- URL: http://arxiv.org/abs/2102.11531v1
- Date: Tue, 23 Feb 2021 07:43:45 GMT
- Title: Memory-efficient Speech Recognition on Smart Devices
- Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan
Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
- Abstract summary: Recurrent transducer models have emerged as a promising solution for speech recognition on smart devices.
These models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices.
We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs.
- Score: 15.015948023187809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent transducer models have emerged as a promising solution for speech
recognition on the current and next generation smart devices. The transducer
models provide competitive accuracy within a reasonable memory footprint
alleviating the memory capacity constraints in these devices. However, these
models access parameters from off-chip memory for every input time step which
adversely effects device battery life and limits their usability on low-power
devices.
We address transducer model's memory access concerns by optimizing their
model architecture and designing novel recurrent cell designs. We demonstrate
that i) model's energy cost is dominated by accessing model weights from
off-chip memory, ii) transducer model architecture is pivotal in determining
the number of accesses to off-chip memory and just model size is not a good
proxy, iii) our transducer model optimizations and novel recurrent cell reduces
off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy
impact.
Related papers
- Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models [0.755189019348525]
Transformer networks, driven by self-attention, are central to Large Language Models.
In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step.
We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells.
arXiv Detail & Related papers (2024-09-28T11:00:11Z) - Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices [19.96064012736243]
This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism.
It reduces memory usage by incorporating dynamic memory management and minimizes inference latency.
We present Hermes, a framework optimized for large model inference on edge devices.
arXiv Detail & Related papers (2024-09-06T12:55:49Z) - Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems [0.0]
Batteryless systems often face power failures, requiring extra runtime buffers to maintain progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs)
We propose FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems.
Our experiments showed that FreeML reduces the model sizes by up to $95 times$, supports adaptive inference with a $2.03-19.65 times$ less memory overhead, and provides significant time and energy benefits with only a negligible accuracy drop compared to the state-of-the-art
arXiv Detail & Related papers (2024-05-16T20:16:45Z) - MEMORYLLM: Towards Self-Updatable Large Language Models [101.3777486749529]
Existing Large Language Models (LLMs) usually remain static after deployment.
We introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool.
MEMORYLLM can self-update with text knowledge and memorize the knowledge injected earlier.
arXiv Detail & Related papers (2024-02-07T07:14:11Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech
Recognition Models [47.99478573698432]
We consider methods to reduce the model size of Conformer-based speech recognition models.
Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.
arXiv Detail & Related papers (2023-03-15T03:21:38Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Improving the Efficiency of Transformers for Resource-Constrained
Devices [1.3019517863608956]
We present a performance analysis of state-of-the-art vision transformers on several devices.
We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x.
arXiv Detail & Related papers (2021-06-30T12:10:48Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - A Compact Gated-Synapse Model for Neuromorphic Circuits [77.50840163374757]
The model is developed in Verilog-A for easy integration into computer-aided design of neuromorphic circuits.
The behavioral theory of the model is described in detail along with a full list of the default parameter settings.
arXiv Detail & Related papers (2020-06-29T18:22:11Z) - Low-rank Gradient Approximation For Memory-Efficient On-device Training
of Deep Neural Network [9.753369031264532]
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models.
One of the major obstacles to achieving this goal is the memory limitation of mobile devices.
We propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory.
arXiv Detail & Related papers (2020-01-24T05:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.