The geometry of hidden representations of large transformer models
- URL: http://arxiv.org/abs/2302.00294v2
- Date: Mon, 30 Oct 2023 16:11:05 GMT
- Title: The geometry of hidden representations of large transformer models
- Authors: Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio,
Alessio Ansuini, Alberto Cazzaniga
- Abstract summary: Large transformers are powerful architectures used for self-supervised data analysis across various data types.
We show that the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next.
We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets.
- Score: 43.16765170255552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large transformers are powerful architectures used for self-supervised data
analysis across various data types, including protein sequences, images, and
text. In these models, the semantic structure of the dataset emerges from a
sequence of transformations between one representation and the next. We
characterize the geometric and statistical properties of these representations
and how they change as we move through the layers. By analyzing the intrinsic
dimension (ID) and neighbor composition, we find that the representations
evolve similarly in transformers trained on protein language tasks and image
reconstruction tasks. In the first layers, the data manifold expands, becoming
high-dimensional, and then contracts significantly in the intermediate layers.
In the last part of the model, the ID remains approximately constant or forms a
second shallow peak. We show that the semantic information of the dataset is
better expressed at the end of the first peak, and this phenomenon can be
observed across many models trained on diverse datasets. Based on our findings,
we point out an explicit strategy to identify, without supervision, the layers
that maximize semantic content: representations at intermediate layers
corresponding to a relative minimum of the ID profile are more suitable for
downstream learning tasks.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Statistical signatures of abstraction in deep neural networks [0.0]
We study how abstract representations emerge in a Deep Belief Network (DBN) trained on benchmark datasets.
We show that the representation approaches an universal model determined by the principle of maximal relevance.
We also show that plasticity increases with depth, in a similar way as it does in the brain.
arXiv Detail & Related papers (2024-07-01T14:13:11Z) - On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We propose an aligned training approach to enhance the similarity between internal representations.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - On Characterizing the Evolution of Embedding Space of Neural Networks
using Algebraic Topology [9.537910170141467]
We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers.
We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value.
arXiv Detail & Related papers (2023-11-08T10:45:12Z) - Learning Structured Output Representations from Attributes using Deep
Conditional Generative Models [0.0]
This paper recreates the Conditional Variational Auto-encoder architecture and trains it on images conditioned on attributes.
We attempt to generate new faces with distinct attributes such as hair color and glasses, as well as different bird species samples.
arXiv Detail & Related papers (2023-04-30T17:25:31Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Two-Stream Graph Convolutional Network for Intra-oral Scanner Image
Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes.
Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.