Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches
- URL: http://arxiv.org/abs/2311.12914v2
- Date: Mon, 25 Dec 2023 08:35:41 GMT
- Title: Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches
- Authors: Quazi Mishkatul Alam, Bilel Tarchoun, Ihsen Alouani, Nael Abu-Ghazaleh
- Abstract summary: Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
- Score: 3.4673556247932225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The latest generation of transformer-based vision models has proven to be
superior to Convolutional Neural Network (CNN)-based models across several
vision tasks, largely attributed to their remarkable prowess in relation
modeling. Deformable vision transformers significantly reduce the quadratic
complexity of attention modeling by using sparse attention structures, enabling
them to incorporate features across different scales and be used in large-scale
applications, such as multi-view vision systems. Recent work has demonstrated
adversarial attacks against conventional vision transformers; we show that
these attacks do not transfer to deformable transformers due to their sparse
attention structure. Specifically, attention in deformable transformers is
modeled using pointers to the most relevant other tokens. In this work, we
contribute for the first time adversarial attacks that manipulate the attention
of deformable transformers, redirecting it to focus on irrelevant parts of the
image. We also develop new collaborative attacks where a source patch
manipulates attention to point to a target patch, which contains the
adversarial noise to fool the model. In our experiments, we observe that
altering less than 1% of the patched area in the input field results in a
complete drop to 0% AP in single-view object detection using MS COCO and a 0%
MODA in multi-view object detection using Wildtrack.
Related papers
- Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Visual Transformer for Object Detection [0.0]
We consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions.
Our model leads to consistent improvements in object detection on COCO across many different models and scales.
arXiv Detail & Related papers (2022-06-01T06:13:09Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - CrossFormer: A Versatile Vision Transformer Based on Cross-scale
Attention [37.39327010226153]
We propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA)
CEL blends each embedding with multiple patches of different scales, providing the model with cross-scale embeddings.
LSDA splits the self-attention module into a short-distance and long-distance one, also lowering the cost but keeping both small-scale and large-scale features in embeddings.
arXiv Detail & Related papers (2021-07-31T05:52:21Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.