Related papers: Memory-efficient Speech Recognition on Smart Devices

Memory-efficient Speech Recognition on Smart Devices

URL: http://arxiv.org/abs/2102.11531v1
Date: Tue, 23 Feb 2021 07:43:45 GMT
Title: Memory-efficient Speech Recognition on Smart Devices
Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
Abstract summary: Recurrent transducer models have emerged as a promising solution for speech recognition on smart devices. These models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices. We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs.
Score: 15.015948023187809
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices. We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs. We demonstrate that i) model's energy cost is dominated by accessing model weights from off-chip memory, ii) transducer model architecture is pivotal in determining the number of accesses to off-chip memory and just model size is not a good proxy, iii) our transducer model optimizations and novel recurrent cell reduces off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy impact.

Related papers

CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO) Our tokenization scheme represents EEG signals at a per-channel patch. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z)
Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices [7.229732269884237]
This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices. The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size. The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
arXiv Detail & Related papers (2024-12-12T13:59:21Z)
Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models [0.755189019348525]
Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells.
arXiv Detail & Related papers (2024-09-28T11:00:11Z)
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices [19.96064012736243]
This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency. We present Hermes, a framework optimized for large model inference on edge devices.
arXiv Detail & Related papers (2024-09-06T12:55:49Z)
Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems [0.0]
Batteryless systems often face power failures, requiring extra runtime buffers to maintain progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs) We propose FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems. Our experiments showed that FreeML reduces the model sizes by up to $95 times$, supports adaptive inference with a $2.03-19.65 times$ less memory overhead, and provides significant time and energy benefits with only a negligible accuracy drop compared to the state-of-the-art
arXiv Detail & Related papers (2024-05-16T20:16:45Z)
MEMORYLLM: Towards Self-Updatable Large Language Models [101.3777486749529]
Existing Large Language Models (LLMs) usually remain static after deployment. We introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool. MEMORYLLM can self-update with text knowledge and memorize the knowledge injected earlier.
arXiv Detail & Related papers (2024-02-07T07:14:11Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models [47.99478573698432]
We consider methods to reduce the model size of Conformer-based speech recognition models. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.
arXiv Detail & Related papers (2023-03-15T03:21:38Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Improving the Efficiency of Transformers for Resource-Constrained Devices [1.3019517863608956]
We present a performance analysis of state-of-the-art vision transformers on several devices. We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x.
arXiv Detail & Related papers (2021-06-30T12:10:48Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
A Compact Gated-Synapse Model for Neuromorphic Circuits [77.50840163374757]
The model is developed in Verilog-A for easy integration into computer-aided design of neuromorphic circuits. The behavioral theory of the model is described in detail along with a full list of the default parameter settings.
arXiv Detail & Related papers (2020-06-29T18:22:11Z)
Low-rank Gradient Approximation For Memory-Efficient On-device Training of Deep Neural Network [9.753369031264532]
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. One of the major obstacles to achieving this goal is the memory limitation of mobile devices. We propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory.
arXiv Detail & Related papers (2020-01-24T05:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.