Related papers: Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

URL: http://arxiv.org/abs/2205.12551v3
Date: Fri, 26 May 2023 07:42:21 GMT
Title: Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers
Authors: Bin Ren, Yahui Liu, Yue Song, Wei Bi, Rita Cucchiara, Nicu Sebe, Wei Wang
Abstract summary: Position Embeddings (PEs) have been shown to improve the performance of Vision Transformers (ViTs) on many vision tasks. PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. We propose a Masked Jigsaw Puzzle (MJP) position embedding method to tackle these issues.
Score: 87.0319004283766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on the accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at~\url{https://github.com/yhlleo/MJP}.

Related papers

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models [109.4033233070067]
The gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data.<n>We introduce a Masked Jigsaw Puzzle (MJP) framework to improve Transformer models' robustness against gradient attacks.<n>Results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.
arXiv Detail & Related papers (2026-01-17T13:32:32Z)
Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently. They regard each pixel as a token, thus suffering from an information loss issue. We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z)
Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z)
AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+ [44.856035786948915]
We propose an Adversarial Positional Embedding (AdPE) approach to pretrain vision transformers. AdPE distorts the local visual structures by perturbing the position encodings. Experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE.
arXiv Detail & Related papers (2023-03-14T02:42:01Z)
PMP: Privacy-Aware Matrix Profile against Sensitive Pattern Inference for Time Series [12.855499575586753]
We propose a new privacy-preserving problem: preventing malicious inference on long shape-based patterns. We find that while Matrix Profile (MP) can prevent concrete shape leakage, the canonical correlation in MP index can still reveal the location of sensitive long pattern. We propose a Privacy-Aware Matrix Profile (PMP) via perturbing the local correlation and breaking the canonical correlation in MP index vector.
arXiv Detail & Related papers (2023-01-04T22:11:38Z)
DeViT: Deformed Vision Transformers in Video Inpainting [59.73019717323264]
We extend previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH) Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching. Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens.
arXiv Detail & Related papers (2022-09-28T08:57:14Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
Reduce Information Loss in Transformers for Pluralistic Image Inpainting [112.50657646357494]
We propose a new transformer based framework "PUT" to keep input information as much as possible. PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.
arXiv Detail & Related papers (2022-05-10T17:59:58Z)
ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection [15.70621878093133]
Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems. Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary task without considering the context. We propose a Video-based Transformer for face PAD (ViTransPAD) with shorttemporal/range-attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames.
arXiv Detail & Related papers (2022-03-03T08:23:20Z)
Short Range Correlation Transformer for Occluded Person Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT. The proposed PFT utilizes three modules to enhance the efficiency of vision transformer. Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.