Related papers: Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices

Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices

URL: http://arxiv.org/abs/2412.09289v1
Date: Thu, 12 Dec 2024 13:59:21 GMT
Title: Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices
Authors: Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville,
Abstract summary: This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices.<n>The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size.<n>The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
Score: 7.229732269884237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices, specifically for on-device indoor localisation. Typical approaches for indoor localisation rely on centralised remote processing of data transmitted from lower powered devices such as wearables. However, there are several benefits for moving this to the edge device itself, including increased battery life, enhanced privacy, reduced latency and lowered operational costs, all of which are key for common applications such as health monitoring. The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size while maintaining high predictive performance. We base our work on a large state-of-the-art transformer-based model and seek to deploy it within low-power MCUs. We also propose a state-space-based architecture using Mamba as a more compact alternative to the transformer. Our results show that the quantized transformer model performs well within a 64 KB RAM constraint, achieving an effective balance between model size and localisation precision. Additionally, the compact Mamba model has strong performance under even tighter constraints, such as a 32 KB of RAM, without the need for model compression, making it a viable option for more resource-limited environments. We demonstrate that, through our framework, it is feasible to deploy advanced indoor localisation models onto low-power MCUs with restricted memory limitations. The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring by providing accurate, real-time location data while minimizing power consumption, increasing data privacy, improving latency and reducing infrastructure costs.

Related papers

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation [1.2338220374261344]
Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices.<n>We introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors LLMs for deployment in specialized domains.<n>By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance.
arXiv Detail & Related papers (2025-06-07T01:37:42Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC [8.837470787975308]
Large Language Models (LLMs) on edge devices offer significant privacy benefits.<n>These on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques.<n>We introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs.
arXiv Detail & Related papers (2025-05-21T02:23:01Z)
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications [0.1874930567916036]
This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models.
arXiv Detail & Related papers (2024-12-25T14:41:23Z)
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms [11.87161637895978]
We introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate large language models on mobile devices. We provide a benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities.
arXiv Detail & Related papers (2024-10-05T03:37:07Z)
Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models. LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z)
MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory [76.02294791513552]
We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory.
arXiv Detail & Related papers (2023-10-25T18:00:26Z)
Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications. These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z)
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z)
Vision Transformer Computation and Resilience for Dynamic Inference [3.6929360462568077]
We leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Most FLOPs are generated by convolutions, not attention. Some models are fairly resilient and their model execution can be adapted without retraining.
arXiv Detail & Related papers (2022-12-06T01:10:31Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers [18.662026553041937]
Machine learning on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of the Internet of Things (IoT) TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget. neural architecture search (NAS) promises to help design accurate ML models that meet the tight MCU memory, latency and energy constraints.
arXiv Detail & Related papers (2020-10-21T19:39:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.