論文の概要: Castling-ViT: Compressing Self-Attention via Switching Towards
Linear-Angular Attention During Vision Transformer Inference
- arxiv url: http://arxiv.org/abs/2211.10526v2
- Date: Mon, 3 Apr 2023 17:20:17 GMT
- ステータス: 処理完了
- システム内更新日: 2023-04-05 00:08:40.225832
- Title: Castling-ViT: Compressing Self-Attention via Switching Towards
Linear-Angular Attention During Vision Transformer Inference
- Title(参考訳): Castling-ViT: 視覚変換器推論における線形角アテンションへの切り替えによる自己注意の圧縮
- Authors: Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang,
Haoqi Fan, Peter Vajda, Yingyan Lin
- Abstract要約: 視覚変換器(ViT)は優れた性能を示しているが、畳み込みニューラルネットワーク(CNN)と比較して計算コストは高い。
- 参考スコア(独自算出の注目度): 44.913419668685066
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformers (ViTs) have shown impressive performance but still
require a high computation cost as compared to convolutional neural networks
(CNNs), one reason is that ViTs' attention measures global similarities and
thus has a quadratic complexity with the number of input tokens. Existing
efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g.,
Performer), which sacrifice ViTs' capabilities of capturing either global or
local context. In this work, we ask an important research question: Can ViTs
learn both global and local context while being more efficient during
inference? To this end, we propose a framework called Castling-ViT, which
trains ViTs using both linear-angular attention and masked softmax-based
quadratic attention, but then switches to having only linear angular attention
during ViT inference. Our Castling-ViT leverages angular kernels to measure the
similarities between queries and keys via spectral angles. And we further
simplify it with two techniques: (1) a novel linear-angular attention
mechanism: we decompose the angular kernels into linear terms and high-order
residuals, and only keep the linear terms; and (2) we adopt two parameterized
modules to approximate high-order residuals: a depthwise convolution and an
auxiliary masked softmax attention to help learn both global and local
information, where the masks for softmax attention are regularized to gradually
become zeros and thus incur no overhead during ViT inference. Extensive
experiments and ablation studies on three tasks consistently validate the
effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher
accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on
COCO detection under comparable FLOPs, as compared to ViTs with vanilla
softmax-based attentions.
- Abstract(参考訳): 視覚変換器(ViT)は優れた性能を示しているが、畳み込みニューラルネットワーク(CNN)と比較して高い計算コストを必要とする。
そこで本稿では,VT を線形角注意とマスク付きソフトマックス2次注意の両方を用いて訓練する Castling-ViT というフレームワークを提案する。
And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference.
- GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets [1.1586742546971471]
グラフ畳み込みプロジェクションとグラフプーリングを利用するグラフベースビジョントランス (GvT) を提案する。
論文 参考訳(メタデータ) (2024-04-07T11:48:07Z) - From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot
Keypoint Detection [36.9781808268263]
FSKD(Few-shot Keypoint Detection)は、参照サンプルに応じて、新規またはベースキーポイントを含むキーポイントをローカライズする。
論文 参考訳(メタデータ) (2023-04-06T15:22:34Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
論文 参考訳(メタデータ) (2022-12-23T19:13:43Z) - ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention [23.874485033096917]
Vision Transformer (ViT)は、様々なコンピュータビジョンアプリケーションのための畳み込みニューラルネットワークの競合代替として登場した。
そこで本研究では,VitaliTy という,VT の推論効率向上のためのハードウェア設計フレームワークを提案する。
論文 参考訳(メタデータ) (2022-11-09T18:58:21Z) - LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
コンボリューションを伴わない純粋な変圧器ブロック上での精度・効率バランスを改善するために,LightViT を軽量 ViT の新たなファミリとして提案する。
論文 参考訳(メタデータ) (2022-07-12T14:27:57Z) - Vicinity Vision Transformer [53.43198716947792]
論文 参考訳(メタデータ) (2022-06-21T17:33:53Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
SUN(Self-promoted sUpervisioN)は視覚変換器(ViT)のための数発の学習フレームワークである
論文 参考訳(メタデータ) (2022-03-14T12:53:27Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) は先日,コンピュータビジョン問題における有望性を実証した。
論文 参考訳(メタデータ) (2022-03-09T23:55:24Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
トレーニング済みのViTを効率よく自動圧縮するSPViT(Single-Path Vision Transformer pruning)を提案する。
論文 参考訳(メタデータ) (2021-11-23T11:35:54Z)