A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays
- URL: http://arxiv.org/abs/2110.10486v1
- Date: Wed, 20 Oct 2021 11:01:23 GMT
- Title: A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays
- Authors: Leonardo Ravaglia, Manuele Rusci, Davide Nadalini, Alessandro
Capotondi, Francesco Conti, Luca Benini
- Abstract summary: Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
- Score: 66.62377866022221
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the last few years, research and development on Deep Learning models and
techniques for ultra-low-power devices in a word, TinyML has mainly focused on
a train-then-deploy assumption, with static models that cannot be adapted to
newly collected data without cloud-based data collection and fine-tuning.
Latent Replay-based Continual Learning (CL) techniques[1] enable online,
serverless adaptation in principle, but so farthey have still been too
computation and memory-hungry for ultra-low-power TinyML devices, which are
typically based on microcontrollers. In this work, we introduce a HW/SW
platform for end-to-end CL based on a 10-core FP32-enabled parallel
ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL
algorithm, leveraging quantization of the frozen stage of the model and Latent
Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In
particular, 8-bit compression of the LR memory proves to be almost lossless
(-0.26% with 3000LR) compared to the full-precision baseline implementation,
but requires 4x less memory, while 7-bit can also be used with an additional
minimal accuracy degradation (up to 5%). We also introduce optimized primitives
for forward and backward propagation on the PULP processor. Our results show
that by combining these techniques, continual learning can be achieved in
practice using less than 64MB of memory an amount compatible with embedding in
TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the
proposed solution performs onaverage 65x faster than a low-power STM32 L4
microcontroller, being 37x more energy efficient enough for a lifetime of 535h
when learning a new mini-batch of data once every minute.
Related papers
- Optimizing TinyML: The Impact of Reduced Data Acquisition Rates for Time Series Classification on Microcontrollers [6.9604565273682955]
This paper investigates how reducing data acquisition rates affects TinyML models for time series classification.
By lowering data sampling frequency, we aim to reduce computational demands RAM usage, energy consumption, latency, and MAC operations by approximately fourfold.
arXiv Detail & Related papers (2024-09-17T07:21:49Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - TinyReptile: TinyML with Federated Meta-Learning [9.618821589196624]
We propose TinyReptile, a simple but efficient algorithm inspired by meta-learning and online learning.
We demonstrate TinyReptile on Raspberry Pi 4 and Cortex-M4 MCU with only 256-KB RAM.
arXiv Detail & Related papers (2023-04-11T13:11:10Z) - Tiny Classifier Circuits: Evolving Accelerators for Tabular Data [0.8936201690845327]
"Tiny" circuits are so tiny (i.e. consisting of no more than 300 logic gates) that they are called "Tiny" circuits.
This paper proposes a methodology for automatically predicting circuits for classification of data with comparable prediction to conventional machine learning.
arXiv Detail & Related papers (2023-02-28T19:13:39Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - TinyML Platforms Benchmarking [0.0]
Recent advances in ultra-low power embedded devices for machine learning (ML) have permitted a new class of products.
TinyML provides a unique solution by aggregating and analyzing data at the edge on low-power embedded devices.
Many TinyML frameworks have been developed for different platforms to facilitate the deployment of ML models.
arXiv Detail & Related papers (2021-11-30T15:26:26Z) - BSC: Block-based Stochastic Computing to Enable Accurate and Efficient
TinyML [10.294484356351152]
Machine learning (ML) has been successfully applied to edge applications, such as smart phones and automated driving.
Today, more applications require ML on tiny devices with extremely limited resources, like implantable cardioverter defibrillator (ICD) which is known as TinyML.
Unlike ML on the edge, TinyML with a limited energy supply has higher demands on low-power execution.
arXiv Detail & Related papers (2021-11-12T12:28:05Z) - TinyTL: Reduce Activations, Not Trainable Parameters for Efficient
On-Device Learning [78.80707950262214]
On-device learning enables edge devices to continually adapt the AI models to new data.
Existing work solves this problem by reducing the number of trainable parameters.
We present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.
arXiv Detail & Related papers (2020-07-22T18:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.