Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token
Migration
- URL: http://arxiv.org/abs/2211.12735v2
- Date: Fri, 5 Jan 2024 02:05:52 GMT
- Title: Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token
Migration
- Authors: Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi
Tian, Qixiang Ye
- Abstract summary: iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT)
Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss.
- Score: 138.24994198567794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose integrally pre-trained transformer pyramid network (iTPN), towards
jointly optimizing the network backbone and the neck, so that transfer gap
between representation models and downstream tasks is minimal. iTPN is born
with two elaborated designs: 1) The first pre-trained feature pyramid upon
vision transformer (ViT). 2) Multi-stage supervision to the feature pyramid
using masked feature modeling (MFM). iTPN is updated to Fast-iTPN, reducing
computational memory overhead and accelerating inference through two flexible
designs. 1) Token migration: dropping redundant tokens of the backbone while
replenishing them in the feature pyramid without attention operations. 2) Token
gathering: reducing computation cost caused by global attention by introducing
few gathering tokens. The base/large-level Fast-iTPN achieve 88.75%/89.5% top-1
accuracy on ImageNet-1K. With 1x training schedule using DINO, the
base/large-level Fast-iTPN achieves 58.4%/58.8% box AP on COCO object
detection, and a 57.5%/58.7% mIoU on ADE20K semantic segmentation using
MaskDINO. Fast-iTPN can accelerate the inference procedure by up to 70%, with
negligible performance loss, demonstrating the potential to be a powerful
backbone for downstream vision tasks. The code is available at:
github.com/sunsmarterjie/iTPN.
Related papers
- GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT)
PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters.
It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z) - Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers [34.19166698049552]
Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs)
We propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module.
We show that our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%.
arXiv Detail & Related papers (2023-03-24T02:12:28Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.