Beyond the final layer: Attentive multilayer fusion for vision transformers
- URL: http://arxiv.org/abs/2601.09322v1
- Date: Wed, 14 Jan 2026 09:50:09 GMT
- Title: Beyond the final layer: Attentive multilayer fusion for vision transformers
- Authors: Laure Ciernik, Marco Morik, Lukas Thede, Luca Eyring, Shinichi Nakajima, Zeynep Akata, Lukas Muttenthaler,
- Abstract summary: We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers.<n>We apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer.<n>This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions.
- Score: 45.627646781613386
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.
Related papers
- LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning [33.76961965760301]
We propose a novel method called Layer-by-Layer Hierarchical Attention Network.<n>It enhances the precision of feature point matching in computer vision by addressing the issue of outliers.<n>Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability.
arXiv Detail & Related papers (2025-12-31T04:25:53Z) - Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition [0.5024983453990065]
layer-wise training eliminates the need for cross-entropy loss and backpropagation.<n>Most existing layer-wise training approaches have been evaluated only on relatively small datasets.<n>We propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R'enyi's $alpha$-order entropy functional.
arXiv Detail & Related papers (2025-10-31T17:24:58Z) - Identifying Super Spreaders in Multilayer Networks [0.6990493129893112]
We introduce a novel approach to identifying super-spreaders in such networks by leveraging graph neural networks.<n>To this end, we construct a dataset by simulating information diffusion across hundreds of networks.<n>Our model, TopSpreadersNetwork, comprises a relationship-agnostic encoder and a custom aggregation layer.
arXiv Detail & Related papers (2025-05-27T10:14:14Z) - PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter [54.33433051500349]
We propose Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model.<n>We also propose a geometry-constrained gate prompt generator (G2PG) shared across different layers.
arXiv Detail & Related papers (2025-05-27T09:27:16Z) - Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond [52.486290612938895]
We propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability.<n> Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM.<n>Our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency.
arXiv Detail & Related papers (2025-03-03T06:16:31Z) - Layer by Layer: Uncovering Hidden Representations in Language Models [28.304269706993942]
We show that intermediate layers can encode even richer representations, often improving performance on a range of downstream tasks.<n>Our framework highlights how each layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance.
arXiv Detail & Related papers (2025-02-04T05:03:42Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks.
We present a novel approach for exploiting the multilayered nature of pretrained models for ASV.
We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z) - Leveraging sparse and shared feature activations for disentangled
representation learning [112.22699167017471]
We propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation.
We validate our approach on six real world distribution shift benchmarks, and different data modalities.
arXiv Detail & Related papers (2023-04-17T01:33:24Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z) - Cross-layer Feature Pyramid Network for Salient Object Detection [102.20031050972429]
We propose a novel Cross-layer Feature Pyramid Network to improve the progressive fusion in salient object detection.
The distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information.
arXiv Detail & Related papers (2020-02-25T14:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.