Related papers: Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

URL: http://arxiv.org/abs/2401.15568v1
Date: Sun, 28 Jan 2024 04:59:51 GMT
Title: Intriguing Equivalence Structures of the Embedding Space of Vision Transformers
Authors: Shaeke Salman and Md Montasir Bin Shams and Xiuwen Liu
Abstract summary: Pre-trained large foundation models play a central role in the recent surge of artificial intelligence. Due to their inherent complexity, these models are not well understood. We show via analyses and systematic experiments that the representation space consists of large piecewise linear subspaces.
Score: 1.7418480517632609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained large foundation models play a central role in the recent surge of artificial intelligence, resulting in fine-tuned models with remarkable abilities when measured on benchmark datasets, standard exams, and applications. Due to their inherent complexity, these models are not well understood. While small adversarial inputs to such models are well known, the structures of the representation space are not well characterized despite their fundamental importance. In this paper, using the vision transformers as an example due to the continuous nature of their input space, we show via analyses and systematic experiments that the representation space consists of large piecewise linear subspaces where there exist very different inputs sharing the same representations, and at the same time, local normal spaces where there are visually indistinguishable inputs having very different representations. The empirical results are further verified using the local directional estimations of the Lipschitz constants of the underlying models. Consequently, the resulting representations change the results of downstream models, and such models are subject to overgeneralization and with limited semantically meaningful generalization capability.

Related papers

Generalization of Diffusion Models Arises with a Balanced Representation Space [32.68561555837436]
We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning.<n>We show that memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized "spiky" representations.<n>We propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering.
arXiv Detail & Related papers (2025-12-24T05:40:40Z)
Universally Converging Representations of Matter Across Scientific Foundation Models [5.309886698585678]
We show that representations learned by nearly sixty scientific models are highly aligned across a wide range of chemical systems.<n>On inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space.<n>Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models.
arXiv Detail & Related papers (2025-12-03T12:47:06Z)
Connecting Neural Models Latent Geometries with Relative Geodesic Representations [21.71782603770616]
We show that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions.<n>We assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric.<n>We validate our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models.
arXiv Detail & Related papers (2025-06-02T12:34:55Z)
Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models [6.390475802910619]
We show that representations learned across models trained on the same data can be expressed as linear combinations of a emphuniversal set of basis features.<n>These basis features underlie the learning task itself and remain consistent across models, regardless of scale.
arXiv Detail & Related papers (2025-05-31T17:45:18Z)
Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z)
Compositional Generalization Requires More Than Disentangled Representations [5.762286612061953]
compositional generalization remains a key challenge for deep learning. Many generative models fail to recognize and compose factors to generate out-of-distribution (OOD) samples. We show that models forced-through architectural modifications with regularization or curated training data-can be highly data-efficient and effective at learning to compose in OOD regions.
arXiv Detail & Related papers (2025-01-30T23:20:41Z)
Analyzing Generative Models by Manifold Entropic Metrics [8.477943884416023]
We introduce a novel set of tractable information-theoretic evaluation metrics. We compare various normalizing flow architectures and $beta$-VAEs on the EMNIST dataset. The most interesting finding of our experiments is a ranking of model architectures and training procedures in terms of their inductive bias to converge to aligned and disentangled representations during training.
arXiv Detail & Related papers (2024-10-25T09:35:00Z)
The Extrapolation Power of Implicit Models [2.3526338188342653]
Implicit models are put to the test across various extrapolation scenarios: out-of-distribution, geographical, and temporal shifts. Our experiments consistently demonstrate significant performance advantage with implicit models.
arXiv Detail & Related papers (2024-07-19T16:01:37Z)
Measuring Orthogonality in Representations of Generative Models [81.13466637365553]
In unsupervised representation learning, models aim to distill essential features from high-dimensional data into lower-dimensional learned representations. Disentanglement of independent generative processes has long been credited with producing high-quality representations. We propose two novel metrics: Importance-Weighted Orthogonality (IWO) and Importance-Weighted Rank (IWR)
arXiv Detail & Related papers (2024-07-04T08:21:54Z)
Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models. Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z)
Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models [7.360937524701675]
Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99% zero-shot classification performance, it fails systematic evaluations completely.
arXiv Detail & Related papers (2024-02-13T14:07:49Z)
Emergence of Segmentation with Minimalistic White-Box Transformers [22.688777622988795]
Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable.
arXiv Detail & Related papers (2023-08-30T19:02:17Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables. The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning. We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z)
DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z)
Learning from few examples with nonlinear feature maps [68.8204255655161]
We explore the phenomenon and reveal key relationships between dimensionality of AI model's feature space, non-degeneracy of data distributions, and the model's generalisation capabilities. The main thrust of our present analysis is on the influence of nonlinear feature transformations mapping original data into higher- and possibly infinite-dimensional spaces on the resulting model's generalisation capabilities.
arXiv Detail & Related papers (2022-03-31T10:36:50Z)
Low-Rank Constraints for Fast Inference in Structured Models [110.38427965904266]
This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models. Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces.
arXiv Detail & Related papers (2022-01-08T00:47:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.