Clo-HDnn: A 4.66 TFLOPS/W and 3.78 TOPS/W Continual On-Device Learning Accelerator with Energy-efficient Hyperdimensional Computing via Progressive Search
- URL: http://arxiv.org/abs/2507.17953v1
- Date: Wed, 23 Jul 2025 21:50:28 GMT
- Title: Clo-HDnn: A 4.66 TFLOPS/W and 3.78 TOPS/W Continual On-Device Learning Accelerator with Energy-efficient Hyperdimensional Computing via Progressive Search
- Authors: Chang Eun Song, Weihong Xu, Keming Fan, Soumil Jain, Gopabandhu Hota, Haichao Yang, Leo Liu, Kerem Akarvardar, Meng-Fan Chang, Carlos H. Diaz, Gert Cauwenberghs, Tajana Rosing, Mingu Kang,
- Abstract summary: Clo-HDnn is an on-device learning (ODL) accelerator designed for emerging continual learning (CL) tasks.<n>Its dual-mode operation enables bypassing costly feature extraction for simpler datasets.<n>Its progressive search reduces complexity by up to 61% by encoding and comparing only partial query hypervectors.
- Score: 7.700041585751539
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Clo-HDnn is an on-device learning (ODL) accelerator designed for emerging continual learning (CL) tasks. Clo-HDnn integrates hyperdimensional computing (HDC) along with low-cost Kronecker HD Encoder and weight clustering feature extraction (WCFE) to optimize accuracy and efficiency. Clo-HDnn adopts gradient-free CL to efficiently update and store the learned knowledge in the form of class hypervectors. Its dual-mode operation enables bypassing costly feature extraction for simpler datasets, while progressive search reduces complexity by up to 61% by encoding and comparing only partial query hypervectors. Achieving 4.66 TFLOPS/W (FE) and 3.78 TOPS/W (classifier), Clo-HDnn delivers 7.77x and 4.85x higher energy efficiency compared to SOTA ODL accelerators.
Related papers
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z) - Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data [0.15178034047411867]
On-device learning at the edge enables low-latency, private personalization with improved long-term robustness and reduced maintenance costs.<n> achieving scalable, low-power end-to-end on-chip learning, especially from real-world sequential data with a limited number of examples is an open challenge.<n>We present Chameleon, leveraging three key contributions to solve these challenges.
arXiv Detail & Related papers (2025-05-30T17:49:30Z) - DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - FSL-HDnn: A 5.7 TOPS/W End-to-end Few-shot Learning Classifier Accelerator with Feature Extraction and Hyperdimensional Computing [8.836803844185619]
FSL-HDnn is an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction, classification, and on-chip few-shot learning.
It integrates two low-power modules: Weight clustering feature extractor and Hyperdimensional Computing.
It achieves an Intensity unprecedented energy efficiency of 5.7 TOPS/W for feature 1 extraction and 0.78 TOPS/W for classification and learning Training Intensity phases.
arXiv Detail & Related papers (2024-09-17T06:23:12Z) - 3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation [19.94836580257577]
This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST)<n>The 3DLST is equipped with a novel W-net architecture instead of the common U-net design.<n>It achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.
arXiv Detail & Related papers (2024-05-23T20:41:15Z) - HEAL: Brain-inspired Hyperdimensional Efficient Active Learning [13.648600396116539]
We introduce Hyperdimensional Efficient Active Learning (HEAL), a novel Active Learning framework tailored for HDC classification.
HEAL proactively annotates unlabeled data points via uncertainty and diversity-guided acquisition, leading to a more efficient dataset annotation and lowering labor costs.
Our evaluation shows that HEAL surpasses a diverse set of baselines in AL quality and achieves notably faster acquisition than many BNN-powered or diversity-guided AL methods.
arXiv Detail & Related papers (2024-02-17T08:41:37Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z) - Highly Parallel Autoregressive Entity Linking with Discriminative
Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions.
Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - SHEARer: Highly-Efficient Hyperdimensional Computing by
Software-Hardware Enabled Multifold Approximation [7.528764144503429]
We propose SHEARer, an algorithm-hardware co-optimization to improve the performance and energy consumption of HD computing.
SHEARer achieves an average throughput boost of 104,904x (15.7x) and energy savings of up to 56,044x (301x) compared to state-of-the-art encoding methods.
We also develop a software framework that enables training HD models by emulating the proposed approximate encodings.
arXiv Detail & Related papers (2020-07-20T07:58:44Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.