Related papers: LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

URL: http://arxiv.org/abs/2511.00812v1
Date: Sun, 02 Nov 2025 05:51:48 GMT
Title: LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons
Authors: Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John,
Abstract summary: Vision Transformers have been tremendously successful in computer vision tasks.<n>Large computational, memory, and energy demands are a challenge for edge inference on FPGAs.<n>We propose LL-ViT, a novel edge optimized vision transformer design.
Score: 1.213604453116022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.

Related papers

EdgeFlex-Transformer: Transformer Inference for Edge Devices [2.1130318406254074]
We propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs)<n>Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning.<n>Experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline.
arXiv Detail & Related papers (2025-12-17T21:45:12Z)
Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference [0.30104001512119216]
Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. We build on an approach for learning LUT networks directly via an Extended Finite Difference method. This allows for a computational and energy-efficient inference solution for transformer-based models.
arXiv Detail & Related papers (2024-11-04T05:38:56Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
LPViT: Low-Power Semi-structured Pruning for Vision Transformers [43.126752035656196]
Vision transformers have emerged as a promising alternative to convolutional neural networks for image analysis tasks.<n>One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.<n>We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z)
TransAxx: Efficient Transformers with Approximate Computing [11.8440256799336]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs)<n>We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic.<n>Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z)
A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE [0.8403582577557918]
Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. We propose a lightweight hybrid model which uses Neural ODE as a backbone instead of ResNet. The proposed model is deployed on a modest-sized FPGA device for edge computing.
arXiv Detail & Related papers (2024-01-05T09:32:39Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
ViTA: A Vision Transformer Inference Accelerator for Edge Applications [4.3469216446051995]
Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks. They are compute-heavy and difficult to deploy in resource-constrained edge devices. We propose ViTA - a hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices.
arXiv Detail & Related papers (2023-02-17T19:35:36Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs) We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.