Related papers: Accelerating Vision Transformers on Brain Processing Unit

Accelerating Vision Transformers on Brain Processing Unit

URL: http://arxiv.org/abs/2602.06300v1
Date: Fri, 06 Feb 2026 01:48:54 GMT
Title: Accelerating Vision Transformers on Brain Processing Unit
Authors: Jinchi Tang, Yan Guo,
Abstract summary: Vision Transformer (ViT) models have demonstrated superior performance and play increasingly crucial roles in computer vision tasks.<n>We propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators.<n>This is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach.
Score: 2.541819265668514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.

Related papers

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.<n>We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z)
BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures [18.355570259898]
We propose the design of a 3D heterogeneous architecture referred to as Atleus.<n>Atleus incorporates heterogeneous computing resources specifically optimized to accelerate transformer models.<n>We show that Atleus outperforms existing state-of-the-art by up to 56x and 64.5x in terms of performance and energy efficiency respectively.
arXiv Detail & Related papers (2025-01-16T15:11:33Z)
TOAST: Transformer Optimization using Adaptive and Simple Transformations [40.311292704886235]
We introduce TOAST, a framework that exploits redundancies to approximate entire transformer blocks with lightweight closed-form mappings.<n>Results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.
arXiv Detail & Related papers (2024-10-07T11:35:24Z)
Efficient Neural Net Approaches in Metal Casting Defect Detection [0.0]
This research proposes a lightweight architecture that is efficient in terms of accuracy and inference time. Our results indicate that a custom model of 590K parameters with depth-wise separable convolutions outperformed pretrained architectures.
arXiv Detail & Related papers (2022-08-08T13:54:36Z)
Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision. We offer three insights based on simple and easy to implement variants of vision transformers. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use. Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z)
CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs) We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.