Related papers: Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

URL: http://arxiv.org/abs/2603.05280v1
Date: Thu, 05 Mar 2026 15:23:41 GMT
Title: Layer by layer, module by module: Choose both for optimal OOD probing of ViT
Authors: Ambroise Odonnat, Vasilii Feofanov, Laetitia Chapel, Romain Tavenard, Ievgen Redko,
Abstract summary: We study the behavior of intermediate layers in pretrained vision transformers.<n>We find that distribution shift between pretraining and downstream data is the primary cause of performance degradation.
Score: 16.482899285404145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Related papers

Out-of-distribution transfer of PDE foundation models to material dynamics under extreme loading [86.6550968435969]
Most PDE foundation models are pretrained and fine-tuned on fluid-centric benchmarks.<n>We benchmark out-of-distribution transfer on two discontinuity-dominated regimes in which shocks, evolving interfaces, and fracture produce highly non-smooth fields.<n>We evaluate two open-source PDE foundation models, POSEIDON and MORPH, and compare fine-tuning from pretrained weights against training from scratch across training-set sizes to quantify sample efficiency under distribution shift.
arXiv Detail & Related papers (2026-03-04T18:19:35Z)
Beyond the final layer: Attentive multilayer fusion for vision transformers [45.627646781613386]
We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers.<n>We apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer.<n>This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions.
arXiv Detail & Related papers (2026-01-14T09:50:09Z)
ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models [17.400439953606913]
Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.<n>ECGs are used in the diagnosis and treatment of heart disease.
arXiv Detail & Related papers (2025-08-27T20:30:03Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models [25.04763038570959]
Directional Gradient Projection (DiGraP) is a layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization.<n>We first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks.<n> Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones.
arXiv Detail & Related papers (2025-02-21T19:31:55Z)
Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. We present a novel approach for exploiting the multilayered nature of pretrained models for ASV. We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z)
Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination. Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z)
Deep Fusion: Capturing Dependencies in Contrastive Learning via Transformer Projection Heads [0.0]
Contrastive Learning (CL) has emerged as a powerful method for training feature extraction models using unlabeled data. Recent studies suggest that incorporating a linear projection head post-backbone significantly enhances model performance. We introduce a novel application of transformers in the projection head role for contrastive learning, marking the first endeavor of its kind.
arXiv Detail & Related papers (2024-03-27T15:24:54Z)
Enhancing Out-of-Distribution Detection with Multitesting-based Layer-wise Feature Fusion [11.689517005768046]
Out-of-distribution samples may exhibit shifts in local or global features compared to the training distribution. We propose a novel framework, Multitesting-based Layer-wise Out-of-Distribution (OOD) Detection. Our scheme effectively enhances the performance of out-of-distribution detection when compared to baseline methods.
arXiv Detail & Related papers (2024-03-16T04:35:04Z)
Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models. It is most prominently used in federated learning. We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z)
On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z)
Domain Adaptation with Adversarial Training on Penultimate Activations [82.9977759320565]
Enhancing model prediction confidence on unlabeled target data is an important objective in Unsupervised Domain Adaptation (UDA) We show that this strategy is more efficient and better correlated with the objective of boosting prediction confidence than adversarial training on input images or intermediate features.
arXiv Detail & Related papers (2022-08-26T19:50:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.