TRT-ViT: TensorRT-oriented Vision Transformer
- URL: http://arxiv.org/abs/2205.09579v2
- Date: Mon, 23 May 2022 15:19:57 GMT
- Title: TRT-ViT: TensorRT-oriented Vision Transformer
- Authors: Xin Xia, Jiashi Li, Jie Wu, Xing Wang, Xuefeng Xiao, Min Zheng, Rui
Wang
- Abstract summary: A family ofRT-oriented Transformers is presented, abbreviated as TRT-ViT.
Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers.
- Score: 19.173764508139016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We revisit the existing excellent Transformers from the perspective of
practical application. Most of them are not even as efficient as the basic
ResNets series and deviate from the realistic deployment scenario. It may be
due to the current criterion to measure computation efficiency, such as FLOPs
or parameters is one-sided, sub-optimal, and hardware-insensitive. Thus, this
paper directly treats the TensorRT latency on the specific hardware as an
efficiency metric, which provides more comprehensive feedback involving
computational capacity, memory cost, and bandwidth. Based on a series of
controlled experiments, this work derives four practical guidelines for
TensorRT-oriented and deployment-friendly network design, e.g., early CNN and
late Transformer at stage-level, early Transformer and late CNN at block-level.
Accordingly, a family of TensortRT-oriented Transformers is presented,
abbreviated as TRT-ViT. Extensive experiments demonstrate that TRT-ViT
significantly outperforms existing ConvNets and vision Transformers with
respect to the latency/accuracy trade-off across diverse visual tasks, e.g.,
image classification, object detection and semantic segmentation. For example,
at 82.7% ImageNet-1k top-1 accuracy, TRT-ViT is 2.7$\times$ faster than CSWin
and 2.0$\times$ faster than Twins. On the MS-COCO object detection task,
TRT-ViT achieves comparable performance with Twins, while the inference speed
is increased by 2.8$\times$.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - TerViT: An Efficient Ternary Vision Transformer [21.348788407233265]
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
arXiv Detail & Related papers (2022-01-20T08:29:19Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.