Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
- URL: http://arxiv.org/abs/2505.24852v2
- Date: Tue, 01 Jul 2025 13:51:01 GMT
- Title: Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
- Authors: Douwe den Blanken, Charlotte Frenkel,
- Abstract summary: On-device learning at the edge enables low-latency, private personalization with improved long-term robustness and reduced maintenance costs.<n> achieving scalable, low-power end-to-end on-chip learning, especially from real-world sequential data with a limited number of examples is an open challenge.<n>We present Chameleon, leveraging three key contributions to solve these challenges.
- Score: 0.15178034047411867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-device learning at the edge enables low-latency, private personalization with improved long-term robustness and reduced maintenance costs. Yet, achieving scalable, low-power end-to-end on-chip learning, especially from real-world sequential data with a limited number of examples, is an open challenge. Indeed, accelerators supporting error backpropagation optimize for learning performance at the expense of inference efficiency, while simplified learning algorithms often fail to reach acceptable accuracy targets. In this work, we present Chameleon, leveraging three key contributions to solve these challenges. (i) A unified learning and inference architecture supports few-shot learning (FSL), continual learning (CL) and inference at only 0.5% area overhead to the inference logic. (ii) Long temporal dependencies are efficiently captured with temporal convolutional networks (TCNs), enabling the first demonstration of end-to-end on-chip FSL and CL on sequential data and inference on 16-kHz raw audio. (iii) A dual-mode, matrix-multiplication-free compute array allows either matching the power consumption of state-of-the-art inference-only keyword spotting (KWS) accelerators or enabling $4.3\times$ higher peak GOPS. Fabricated in 40-nm CMOS, Chameleon sets new accuracy records on Omniglot for end-to-end on-chip FSL (96.8%, 5-way 1-shot, 98.8%, 5-way 5-shot) and CL (82.2% final accuracy for learning 250 classes with 10 shots), while maintaining an inference accuracy of 93.3% on the 12-class Google Speech Commands dataset at an extreme-edge power budget of 3.1 $\mu$W.
Related papers
- UnIT: Scalable Unstructured Inference-Time Pruning for MAC-efficient Neural Inference on MCUs [1.9626657740463982]
UnIT (Unstructured Inference-Time pruning) is a lightweight method that dynamically identifies and skips unnecessary multiply-accumulate (MAC) operations during inference.<n>It transforms pruning decisions into lightweight comparisons, replacing multiplications with threshold checks and approximated divisions.<n>UnIT achieves 11.02% to 82.03% MAC reduction, 27.30% to 84.19% faster inference, and 27.33% to 84.38% lower energy consumption compared to training-time pruned models.
arXiv Detail & Related papers (2025-07-10T16:12:06Z) - A Stable Whitening Optimizer for Efficient Neural Network Training [101.89246340672246]
Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method.<n>First, we find that naive Shampoo is prone to divergence when matrix-inverses are cached for long periods.<n>Second, we adapt a shape-aware scaling to enable learning rate transfer across network width.<n>Third, we find that high learning rates result in large parameter noise, and propose a simple iterate-averaging scheme which unblocks faster learning.
arXiv Detail & Related papers (2025-06-08T18:43:31Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers.
Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z) - FSL-HDnn: A 5.7 TOPS/W End-to-end Few-shot Learning Classifier Accelerator with Feature Extraction and Hyperdimensional Computing [8.836803844185619]
FSL-HDnn is an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction, classification, and on-chip few-shot learning.
It integrates two low-power modules: Weight clustering feature extractor and Hyperdimensional Computing.
It achieves an Intensity unprecedented energy efficiency of 5.7 TOPS/W for feature 1 extraction and 0.78 TOPS/W for classification and learning Training Intensity phases.
arXiv Detail & Related papers (2024-09-17T06:23:12Z) - Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices [10.733902200950872]
We explore alternatives to fuse the depthwise and pointwise kernels that constitute the separable convolutional block.
Our approach aims to minimize time-consuming memory transfers by combining different data layouts.
arXiv Detail & Related papers (2024-06-18T10:32:40Z) - Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud
Semantic Segmentation via Decoupling Optimization [64.36097398869774]
Semi-supervised learning (SSL) has been an active research topic for large-scale 3D scene understanding.
The existing SSL-based methods suffer from severe training bias due to class imbalance and long-tail distributions of the point cloud data.
We introduce a new decoupling optimization framework, which disentangles feature representation learning and classifier in an alternative optimization manner to shift the bias decision boundary effectively.
arXiv Detail & Related papers (2024-01-13T04:16:40Z) - Hadamard Domain Training with Integers for Class Incremental Quantized
Learning [1.4416751609100908]
Continual learning can be cost-prohibitive for resource-constraint edge platforms.
We propose a technique that transforms to enable low-precision training with only integer matrix multiplications.
We achieve less than 0.5% and 3% accuracy degradation while we quantize all matrix multiplications inputs down to 4-bits with 8-bit accumulators.
arXiv Detail & Related papers (2023-10-05T16:52:59Z) - Does Continual Learning Equally Forget All Parameters? [55.431048995662714]
Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of neural networks.
We study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL.
We propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL.
arXiv Detail & Related papers (2023-04-09T04:36:24Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Human Activity Recognition on Microcontrollers with Quantized and
Adaptive Deep Neural Networks [10.195581493173643]
Human Activity Recognition (HAR) based on inertial data is an increasingly diffused task on embedded devices.
Most embedded HAR systems are based on simple and not-so-accurate classic machine learning algorithms.
This work proposes a set of efficient one-dimensional Convolutional Neural Networks (CNNs) deployable on general purpose microcontrollers (MCUs)
arXiv Detail & Related papers (2022-09-02T06:32:11Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.