Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT
- URL: http://arxiv.org/abs/2504.16128v1
- Date: Mon, 21 Apr 2025 06:56:41 GMT
- Title: Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT
- Authors: Stanley Mugisha, Rashid Kisitu, Florence Tushabe,
- Abstract summary: This work advances real-time, energy-efficient crop monitoring in precision agriculture.<n>It demonstrates how we can attain ViT-level diagnostic precision on edge devices.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.
Related papers
- Semiconductor Wafer Map Defect Classification with Tiny Vision Transformers [0.0]
We propose ViT-Tiny, a lightweight Vision Transformer (ViT) framework optimized for wafer defect classification.<n>ViT-Tiny outperforms its ViT-Base counterpart and state-of-the-art (SOTA) models, such as MSF-Trans and CNN-based architectures.<n>It achieves an F1-score of 98.4%, surpassing MSF-Trans by 2.94% in four-defect classification, improving recall by 2.86% in two-defect classification, and increasing precision by 3.13% in three-defect classification.
arXiv Detail & Related papers (2025-04-03T11:18:00Z) - MobilePlantViT: A Mobile-friendly Hybrid ViT for Generalized Plant Disease Image Classification [2.0681376988193843]
Plant diseases significantly threaten global food security.<n>Deep learning models have demonstrated impressive performance in plant disease identification.<n> deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints.<n>We propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification.
arXiv Detail & Related papers (2025-03-20T18:34:02Z) - Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors [7.630782404476683]
We introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile ( PDP)
We also propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy.
arXiv Detail & Related papers (2025-01-14T01:16:30Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.<n>We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.<n>Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Fast Cell Library Characterization for Design Technology Co-Optimization Based on Graph Neural Networks [0.1752969190744922]
Design technology co-optimization (DTCO) plays a critical role in achieving optimal power, performance, and area.
We propose a graph neural network (GNN)-based machine learning model for rapid and accurate cell library characterization.
arXiv Detail & Related papers (2023-12-20T06:10:27Z) - Crop Disease Classification using Support Vector Machines with Green
Chromatic Coordinate (GCC) and Attention based feature extraction for IoT
based Smart Agricultural Applications [0.0]
Plant diseases can negatively affect leaves during agricultural cultivation, resulting in significant losses in crop output and economic value.
Various machine learning (ML) as well as deep learning (DL) algorithms have been created & studied for the identification of plant disease detection.
This article presents a novel classification method that builds on prior work by utilising attention-based feature extraction, RGB channel-based chromatic analysis, Support Vector Machines (SVM) for improved performance.
arXiv Detail & Related papers (2023-11-01T10:44:49Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Adaptive Anomaly Detection for IoT Data in Hierarchical Edge Computing [71.86955275376604]
We propose an adaptive anomaly detection approach for hierarchical edge computing (HEC) systems to solve this problem.
We design an adaptive scheme to select one of the models based on the contextual information extracted from input data, to perform anomaly detection.
We evaluate our proposed approach using a real IoT dataset, and demonstrate that it reduces detection delay by 84% while maintaining almost the same accuracy as compared to offloading detection tasks to the cloud.
arXiv Detail & Related papers (2020-01-10T05:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.