BSC: Block-based Stochastic Computing to Enable Accurate and Efficient
TinyML
- URL: http://arxiv.org/abs/2111.06686v1
- Date: Fri, 12 Nov 2021 12:28:05 GMT
- Title: BSC: Block-based Stochastic Computing to Enable Accurate and Efficient
TinyML
- Authors: Yuhong Song, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Rui Xu, Yongzhuo
Zhang, Bingzhe Li, Lei Yang
- Abstract summary: Machine learning (ML) has been successfully applied to edge applications, such as smart phones and automated driving.
Today, more applications require ML on tiny devices with extremely limited resources, like implantable cardioverter defibrillator (ICD) which is known as TinyML.
Unlike ML on the edge, TinyML with a limited energy supply has higher demands on low-power execution.
- Score: 10.294484356351152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Along with the progress of AI democratization, machine learning (ML) has been
successfully applied to edge applications, such as smart phones and automated
driving. Nowadays, more applications require ML on tiny devices with extremely
limited resources, like implantable cardioverter defibrillator (ICD), which is
known as TinyML. Unlike ML on the edge, TinyML with a limited energy supply has
higher demands on low-power execution. Stochastic computing (SC) using
bitstreams for data representation is promising for TinyML since it can perform
the fundamental ML operations using simple logical gates, instead of the
complicated binary adder and multiplier. However, SC commonly suffers from low
accuracy for ML tasks due to low data precision and inaccuracy of arithmetic
units. Increasing the length of the bitstream in the existing works can
mitigate the precision issue but incur higher latency. In this work, we propose
a novel SC architecture, namely Block-based Stochastic Computing (BSC). BSC
divides inputs into blocks, such that the latency can be reduced by exploiting
high data parallelism. Moreover, optimized arithmetic units and output revision
(OUR) scheme are proposed to improve accuracy. On top of it, a global
optimization approach is devised to determine the number of blocks, which can
make a better latency-power trade-off. Experimental results show that BSC can
outperform the existing designs in achieving over 10% higher accuracy on ML
tasks and over 6 times power reduction.
Related papers
- SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding [28.76164449548306]
Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences.
We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead.
arXiv Detail & Related papers (2024-11-08T02:47:07Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - SignSGD with Federated Voting [69.06621279967865]
SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization.
We propose a novel signSGD with textitfederated voting (signSGD-FV)
The idea of federated voting is to exploit learnable weights to perform weighted majority voting.
We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes.
arXiv Detail & Related papers (2024-03-25T02:32:43Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Resource frugal optimizer for quantum machine learning [0.7046417074932257]
Quantum-enhanced data science, also known as quantum machine learning (QML), is of growing interest as an application of near-term quantum computers.
Variational QML algorithms have the potential to solve practical problems on real hardware, particularly when involving quantum data.
We advocate for simultaneous random sampling over both the datasets as well as the measurement operators that define the loss function.
arXiv Detail & Related papers (2022-11-09T15:29:03Z) - QuaLA-MiniLM: a Quantized Length Adaptive MiniLM [5.36703735486629]
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized.
A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding.
Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss.
We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
arXiv Detail & Related papers (2022-10-31T07:42:52Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - Exponential Error Convergence in Data Classification with Optimized
Random Features: Acceleration by Quantum Machine Learning [8.98526174345299]
An algorithm for machine learning by quantum computer, quantum machine learning (QML), can exponentially speed up sampling of optimized random features.
We here construct a QML algorithm for a classification task accelerated by the optimized random features.
We prove that the QML algorithm for optimized random features, combined with gradient descent (SGD), can achieve state-of-the-art exponential convergence speed.
arXiv Detail & Related papers (2021-06-16T18:00:00Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.