Memory-efficient Speech Recognition on Smart Devices
- URL: http://arxiv.org/abs/2102.11531v1
- Date: Tue, 23 Feb 2021 07:43:45 GMT
- Title: Memory-efficient Speech Recognition on Smart Devices
- Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan
Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
- Abstract summary: Recurrent transducer models have emerged as a promising solution for speech recognition on smart devices.
These models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices.
We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs.
- Score: 15.015948023187809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent transducer models have emerged as a promising solution for speech
recognition on the current and next generation smart devices. The transducer
models provide competitive accuracy within a reasonable memory footprint
alleviating the memory capacity constraints in these devices. However, these
models access parameters from off-chip memory for every input time step which
adversely effects device battery life and limits their usability on low-power
devices.
We address transducer model's memory access concerns by optimizing their
model architecture and designing novel recurrent cell designs. We demonstrate
that i) model's energy cost is dominated by accessing model weights from
off-chip memory, ii) transducer model architecture is pivotal in determining
the number of accesses to off-chip memory and just model size is not a good
proxy, iii) our transducer model optimizations and novel recurrent cell reduces
off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy
impact.
Related papers
- CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO)
Our tokenization scheme represents EEG signals at a per-channel patch.
We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z) - Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices [7.229732269884237]
This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices.
The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size.
The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
arXiv Detail & Related papers (2024-12-12T13:59:21Z) - Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models [0.755189019348525]
Transformer networks, driven by self-attention, are central to Large Language Models.
In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step.
We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells.
arXiv Detail & Related papers (2024-09-28T11:00:11Z) - Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices [19.96064012736243]
This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism.
It reduces memory usage by incorporating dynamic memory management and minimizes inference latency.
We present Hermes, a framework optimized for large model inference on edge devices.
arXiv Detail & Related papers (2024-09-06T12:55:49Z) - Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems [0.0]
Batteryless systems often face power failures, requiring extra runtime buffers to maintain progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs)
We propose FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems.
Our experiments showed that FreeML reduces the model sizes by up to $95 times$, supports adaptive inference with a $2.03-19.65 times$ less memory overhead, and provides significant time and energy benefits with only a negligible accuracy drop compared to the state-of-the-art
arXiv Detail & Related papers (2024-05-16T20:16:45Z) - MEMORYLLM: Towards Self-Updatable Large Language Models [101.3777486749529]
Existing Large Language Models (LLMs) usually remain static after deployment.
We introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool.
MEMORYLLM can self-update with text knowledge and memorize the knowledge injected earlier.
arXiv Detail & Related papers (2024-02-07T07:14:11Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech
Recognition Models [47.99478573698432]
We consider methods to reduce the model size of Conformer-based speech recognition models.
Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors.
arXiv Detail & Related papers (2023-03-15T03:21:38Z) - Improving the Efficiency of Transformers for Resource-Constrained
Devices [1.3019517863608956]
We present a performance analysis of state-of-the-art vision transformers on several devices.
We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x.
arXiv Detail & Related papers (2021-06-30T12:10:48Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - A Compact Gated-Synapse Model for Neuromorphic Circuits [77.50840163374757]
The model is developed in Verilog-A for easy integration into computer-aided design of neuromorphic circuits.
The behavioral theory of the model is described in detail along with a full list of the default parameter settings.
arXiv Detail & Related papers (2020-06-29T18:22:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.