Related papers: BSC: Block-based Stochastic Computing to Enable Accurate and Efficient TinyML

BSC: Block-based Stochastic Computing to Enable Accurate and Efficient TinyML

URL: http://arxiv.org/abs/2111.06686v1
Date: Fri, 12 Nov 2021 12:28:05 GMT
Title: BSC: Block-based Stochastic Computing to Enable Accurate and Efficient TinyML
Authors: Yuhong Song, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Rui Xu, Yongzhuo Zhang, Bingzhe Li, Lei Yang
Abstract summary: Machine learning (ML) has been successfully applied to edge applications, such as smart phones and automated driving. Today, more applications require ML on tiny devices with extremely limited resources, like implantable cardioverter defibrillator (ICD) which is known as TinyML. Unlike ML on the edge, TinyML with a limited energy supply has higher demands on low-power execution.
Score: 10.294484356351152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Along with the progress of AI democratization, machine learning (ML) has been successfully applied to edge applications, such as smart phones and automated driving. Nowadays, more applications require ML on tiny devices with extremely limited resources, like implantable cardioverter defibrillator (ICD), which is known as TinyML. Unlike ML on the edge, TinyML with a limited energy supply has higher demands on low-power execution. Stochastic computing (SC) using bitstreams for data representation is promising for TinyML since it can perform the fundamental ML operations using simple logical gates, instead of the complicated binary adder and multiplier. However, SC commonly suffers from low accuracy for ML tasks due to low data precision and inaccuracy of arithmetic units. Increasing the length of the bitstream in the existing works can mitigate the precision issue but incur higher latency. In this work, we propose a novel SC architecture, namely Block-based Stochastic Computing (BSC). BSC divides inputs into blocks, such that the latency can be reduced by exploiting high data parallelism. Moreover, optimized arithmetic units and output revision (OUR) scheme are proposed to improve accuracy. On top of it, a global optimization approach is devised to determine the number of blocks, which can make a better latency-power trade-off. Experimental results show that BSC can outperform the existing designs in achieving over 10% higher accuracy on ML tasks and over 6 times power reduction.

Related papers

FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models. It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters. It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z)
LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi [0.48212500317840945]
Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high- throughput, energy-efficient execution of LLMs on low-power embedded systems. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications.
arXiv Detail & Related papers (2025-04-02T20:29:39Z)
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding [28.76164449548306]
Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead.
arXiv Detail & Related papers (2024-11-08T02:47:07Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning. deploying LLM inference poses challenges due to the high compute and memory requirements. We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
SignSGD with Federated Voting [69.06621279967865]
SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization. We propose a novel signSGD with textitfederated voting (signSGD-FV) The idea of federated voting is to exploit learnable weights to perform weighted majority voting. We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes.
arXiv Detail & Related papers (2024-03-25T02:32:43Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Resource frugal optimizer for quantum machine learning [0.7046417074932257]
Quantum-enhanced data science, also known as quantum machine learning (QML), is of growing interest as an application of near-term quantum computers. Variational QML algorithms have the potential to solve practical problems on real hardware, particularly when involving quantum data. We advocate for simultaneous random sampling over both the datasets as well as the measurement operators that define the loss function.
arXiv Detail & Related papers (2022-11-09T15:29:03Z)
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM [5.36703735486629]
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
arXiv Detail & Related papers (2022-10-31T07:42:52Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning [8.98526174345299]
An algorithm for machine learning by quantum computer, quantum machine learning (QML), can exponentially speed up sampling of optimized random features. We here construct a QML algorithm for a classification task accelerated by the optimized random features. We prove that the QML algorithm for optimized random features, combined with gradient descent (SGD), can achieve state-of-the-art exponential convergence speed.
arXiv Detail & Related papers (2021-06-16T18:00:00Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.