ResiDual Transformer Alignment with Spectral Decomposition
- URL: http://arxiv.org/abs/2411.00246v1
- Date: Thu, 31 Oct 2024 22:51:45 GMT
- Title: ResiDual Transformer Alignment with Spectral Decomposition
- Authors: Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele RodolĂ , Francesco Locatello,
- Abstract summary: We analyze the phenomenon in vision transformers, focusing on the spectral geometry of residuals.
We show that they encode specialized roles across a wide variety of input data distributions.
We introduce ResiDual, a technique for spectral alignment of the residual stream.
- Score: 31.14332778586179
- License:
- Abstract: When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performances on different data distributions while modeling an extremely interpretable and parameter-efficient transformation, as we extensively show on more than 50 (pre-trained network, dataset) pairs.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Approaching Deep Learning through the Spectral Dynamics of Weights [41.948042468042374]
spectral dynamics of weights -- the behavior of singular values and vectors during optimization -- to clarify and unify several phenomena in deep learning.
We identify a consistent bias in optimization across various experiments, from small-scale grokking'' to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers.
arXiv Detail & Related papers (2024-08-21T17:48:01Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z) - Hodge-Aware Contrastive Learning [101.56637264703058]
Simplicial complexes prove effective in modeling data with multiway dependencies.
We develop a contrastive self-supervised learning approach for processing simplicial data.
arXiv Detail & Related papers (2023-09-14T00:40:07Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Learning Diverse Features in Vision Transformers for Improved
Generalization [15.905065768434403]
We show that vision transformers (ViTs) tend to extract robust and spurious features with distinct attention heads.
As a result of this modularity, their performance under distribution shifts can be significantly improved at test time.
We propose a method to further enhance the diversity and complement of the learned features by encouragingarity of the attention heads' input gradients.
arXiv Detail & Related papers (2023-08-30T19:04:34Z) - Multi-scale Alternated Attention Transformer for Generalized Stereo
Matching [7.493797166406228]
We present a simple but highly effective network called Alternated Attention U-shaped Transformer (AAUformer) to balance the impact of epipolar line in dual and single view.
Compared to other models, our model has several main designs.
We performed a series of both comparative studies and ablation studies on several mainstream stereo matching datasets.
arXiv Detail & Related papers (2023-08-06T08:22:39Z) - ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers.
We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z) - Multi-Facet Clustering Variational Autoencoders [9.150555507030083]
High-dimensional data, such as images, typically feature multiple interesting characteristics one could cluster over.
We introduce Multi-Facet Clustering Variational Autoencoders (MFCVAE)
MFCVAE learns multiple clusterings simultaneously, and is trained fully unsupervised and end-to-end.
arXiv Detail & Related papers (2021-06-09T17:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.