A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models
- URL: http://arxiv.org/abs/2601.12051v1
- Date: Sat, 17 Jan 2026 13:32:32 GMT
- Title: A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models
- Authors: Weixin Ye, Wei Wang, Yahui Liu, Yue Song, Bin Ren, Wei Bi, Rita Cucchiara, Nicu Sebe,
- Abstract summary: The gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data.<n>We introduce a Masked Jigsaw Puzzle (MJP) framework to improve Transformer models' robustness against gradient attacks.<n>Results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.
- Score: 109.4033233070067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textit{unknown (unk)} position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textit{e.g.,} ImageNet-1K) and sentiment analysis for text (\textit{e.g.,} Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via https://github.com/ywxsuperstar/transformerattack
Related papers
- Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder [5.57393627015653]
Video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models.<n>This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy.<n>We propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2.
arXiv Detail & Related papers (2025-06-28T13:30:36Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search [16.7500024682162]
This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.
Experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.
arXiv Detail & Related papers (2024-06-16T08:37:24Z) - Stochastic positional embeddings improve masked image modeling [95.03491875332034]
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images.
We propose to incorporate location uncertainty into MIM by using positional embeddings (StoP)
StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties.
arXiv Detail & Related papers (2023-07-31T17:59:08Z) - LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding [7.7514466231699455]
This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
arXiv Detail & Related papers (2023-05-30T03:56:07Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision
Transformers [87.0319004283766]
Position Embeddings (PEs) have been shown to improve the performance of Vision Transformers (ViTs) on many vision tasks.
PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed.
We propose a Masked Jigsaw Puzzle (MJP) position embedding method to tackle these issues.
arXiv Detail & Related papers (2022-05-25T07:56:18Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - AAformer: Auto-Aligned Transformer for Person Re-Identification [82.45385078624301]
We introduce an alignment scheme in transformer architecture for the first time.
We propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level.
AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval.
arXiv Detail & Related papers (2021-04-02T08:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.