Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering
- URL: http://arxiv.org/abs/2203.12814v2
- Date: Fri, 12 May 2023 15:15:16 GMT
- Title: Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering
- Authors: Zhou Yu, Zitian Jin, Jun Yu, Mingliang Xu, Hongbo Wang, Jianping Fan
- Abstract summary: bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
- Score: 75.86788916930377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Transformer architectures [1] have brought remarkable
improvements to visual question answering (VQA). Nevertheless,
Transformer-based VQA models are usually deep and wide to guarantee good
performance, so they can only run on powerful GPU servers and cannot run on
capacity-restricted platforms such as mobile phones. Therefore, it is desirable
to learn an elastic VQA model that supports adaptive pruning at runtime to meet
the efficiency constraints of different platforms. To this end, we present the
bilaterally slimmable Transformer (BST), a general framework that can be
seamlessly integrated into arbitrary Transformer-based VQA models to train a
single model once and obtain various slimmed submodels of different widths and
depths. To verify the effectiveness and generality of this method, we integrate
the proposed BST framework with three typical Transformer-based VQA approaches,
namely MCAN [2], UNITER [3], and CLIP-ViL [4], and conduct extensive
experiments on two commonly-used benchmark datasets. In particular, one slimmed
MCAN-BST submodel achieves comparable accuracy on VQA-v2, while being 0.38x
smaller in model size and having 0.27x fewer FLOPs than the reference MCAN
model. The smallest MCAN-BST submodel only has 9M parameters and 0.16G FLOPs
during inference, making it possible to deploy it on a mobile device with less
than 60 ms latency.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting [98.12558945781693]
We propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens.
Although our proposed model employs a simple architecture, it offers compelling performance as shown in our experiments on several datasets for time series forecasting.
arXiv Detail & Related papers (2024-06-07T14:39:28Z) - Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture [31.763186154430347]
We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension.
As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style classification, and causal GPT-style language modeling.
For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in GLUE quality with up to 27% fewer parameters, and up to 9.1$times higher throughput at sequence length 4K
arXiv Detail & Related papers (2023-10-18T17:06:22Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - Vision Transformer Slimming: Multi-Dimension Searching in Continuous
Optimization Space [35.04846842178276]
We introduce a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure across multiple dimensions.
Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions.
Our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by 0.6% on ImageNet.
arXiv Detail & Related papers (2022-01-03T18:59:54Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.