Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision
Transformers
- URL: http://arxiv.org/abs/2205.12551v3
- Date: Fri, 26 May 2023 07:42:21 GMT
- Title: Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision
Transformers
- Authors: Bin Ren, Yahui Liu, Yue Song, Wei Bi, Rita Cucchiara, Nicu Sebe, Wei
Wang
- Abstract summary: Position Embeddings (PEs) have been shown to improve the performance of Vision Transformers (ViTs) on many vision tasks.
PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed.
We propose a Masked Jigsaw Puzzle (MJP) position embedding method to tackle these issues.
- Score: 87.0319004283766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Position Embeddings (PEs), an arguably indispensable component in Vision
Transformers (ViTs), have been shown to improve the performance of ViTs on many
vision tasks. However, PEs have a potentially high risk of privacy leakage
since the spatial information of the input patches is exposed. This caveat
naturally raises a series of interesting questions about the impact of PEs on
the accuracy, privacy, prediction consistency, etc. To tackle these issues, we
propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular,
MJP first shuffles the selected patches via our block-wise random jigsaw puzzle
shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the
non-occluded patches, the PEs remain the original ones but their spatial
relation is strengthened via our dense absolute localization regressor. The
experimental results reveal that 1) PEs explicitly encode the 2D spatial
relationship and lead to severe privacy leakage problems under gradient
inversion attack; 2) Training ViTs with the naively shuffled patches can
alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle
ratio, the proposed MJP not only boosts the performance and robustness on
large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves
the privacy preservation ability under typical gradient attacks by a large
margin. The source code and trained models are available
at~\url{https://github.com/yhlleo/MJP}.
Related papers
- A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models [109.4033233070067]
The gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data.<n>We introduce a Masked Jigsaw Puzzle (MJP) framework to improve Transformer models' robustness against gradient attacks.<n>Results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.
arXiv Detail & Related papers (2026-01-17T13:32:32Z) - Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently.
They regard each pixel as a token, thus suffering from an information loss issue.
We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z) - Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks.
We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads.
Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z) - AdPE: Adversarial Positional Embeddings for Pretraining Vision
Transformers via MAE+ [44.856035786948915]
We propose an Adversarial Positional Embedding (AdPE) approach to pretrain vision transformers.
AdPE distorts the local visual structures by perturbing the position encodings.
Experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE.
arXiv Detail & Related papers (2023-03-14T02:42:01Z) - PMP: Privacy-Aware Matrix Profile against Sensitive Pattern Inference
for Time Series [12.855499575586753]
We propose a new privacy-preserving problem: preventing malicious inference on long shape-based patterns.
We find that while Matrix Profile (MP) can prevent concrete shape leakage, the canonical correlation in MP index can still reveal the location of sensitive long pattern.
We propose a Privacy-Aware Matrix Profile (PMP) via perturbing the local correlation and breaking the canonical correlation in MP index vector.
arXiv Detail & Related papers (2023-01-04T22:11:38Z) - DeViT: Deformed Vision Transformers in Video Inpainting [59.73019717323264]
We extend previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH)
Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching.
Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens.
arXiv Detail & Related papers (2022-09-28T08:57:14Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Reduce Information Loss in Transformers for Pluralistic Image Inpainting [112.50657646357494]
We propose a new transformer based framework "PUT" to keep input information as much as possible.
PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.
arXiv Detail & Related papers (2022-05-10T17:59:58Z) - ViTransPAD: Video Transformer using convolution and self-attention for
Face Presentation Attack Detection [15.70621878093133]
Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems.
Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary task without considering the context.
We propose a Video-based Transformer for face PAD (ViTransPAD) with shorttemporal/range-attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames.
arXiv Detail & Related papers (2022-03-03T08:23:20Z) - Short Range Correlation Transformer for Occluded Person
Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT.
The proposed PFT utilizes three modules to enhance the efficiency of vision transformer.
Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z) - Rethinking and Improving Relative Position Encoding for Vision
Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens.
We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.