Related papers: Attention, Please! Revisiting Attentive Probing for Masked Image Modeling

Attention, Please! Revisiting Attentive Probing for Masked Image Modeling

URL: http://arxiv.org/abs/2506.10178v1
Date: Wed, 11 Jun 2025 21:10:26 GMT
Title: Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Authors: Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias,
Abstract summary: We introduce efficient probing (EP), a cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$times$ speed-up over conventional multi-head attention.<n>EP generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings.
Score: 20.39513629593113
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.

Related papers

Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization [11.10178274806454]
We propose a form of weak supervision that improves the annotation efficiency and detection performance.<n>We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML dataset.<n>We employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions.
arXiv Detail & Related papers (2025-07-17T11:45:27Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing [14.114970711442512]
This paper introduces Attention Pruning, a fairness-aware simulated annealing approach to prune attention heads in large language models (LLMs)<n>Our experiments show that Attention Pruning achieves up to $40%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.
arXiv Detail & Related papers (2025-03-20T03:02:32Z)
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z)
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity [9.092404060771306]
Diffusion models have shown impressive results in generating high-quality conditional samples.<n>However, existing methods often require additional training or neural function evaluations (NFEs)<n>We propose a novel and efficient method, termed PLADIS, which boosts pre-trained models by leveraging sparse attention.
arXiv Detail & Related papers (2025-03-10T07:23:19Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow.<n>We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport.<n>Our method enforces doublyity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z)
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More [26.226145789963443]
Mask-Enhanced Autoregressive Prediction (MEAP) is a training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP)<n>In intensive experiments, MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks.<n>Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens.
arXiv Detail & Related papers (2025-02-11T11:49:03Z)
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation [49.827306773992376]
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions. Our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks.
arXiv Detail & Related papers (2023-12-19T15:34:52Z)
Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches. We present UPET, a novel Uncertainty-aware self-Training framework. We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.