Uncovering hidden geometry in Transformers via disentangling position
and context
- URL: http://arxiv.org/abs/2310.04861v2
- Date: Sun, 4 Feb 2024 01:49:07 GMT
- Title: Uncovering hidden geometry in Transformers via disentangling position
and context
- Authors: Jiajun Song and Yiqiao Zhong
- Abstract summary: We present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components.
For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure.
- Score: 0.6118897979046375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are widely used to extract semantic meanings from input tokens,
yet they usually operate as black-box models. In this paper, we present a
simple yet informative decomposition of hidden states (or embeddings) of
trained transformers into interpretable components. For any layer, embedding
vectors of input sequence samples are represented by a tensor $\boldsymbol{h}
\in \mathbb{R}^{C \times T \times d}$. Given embedding vector
$\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a
sequence (or context) $c \le C$, extracting the mean effects yields the
decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t +
\mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global
mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across
contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the
residual vector. For popular transformer architectures and diverse text
datasets, empirically we find pervasive mathematical structure: (1)
$(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral
shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure
that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and
$(\mathbf{ctx}_c)_c$ are mutually nearly orthogonal. We argue that smoothness
is pervasive and beneficial to transformers trained on languages, and our
decomposition leads to improved model interpretability.
Related papers
- Transformer In-Context Learning for Categorical Data [51.23121284812406]
We extend research on understanding Transformers through the lens of in-context learning with functional data by considering categorical outcomes, nonlinear underlying models, and nonlinear attention.
We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
arXiv Detail & Related papers (2024-05-27T15:03:21Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Families of costs with zero and nonnegative MTW tensor in optimal
transport [0.0]
We compute explicitly the MTW tensor for the optimal transport problem on $mathbbRn$ with a cost function of form $mathsfc$.
We analyze the $sinh$-type hyperbolic cost, providing examples of $mathsfc$-type functions and divergence.
arXiv Detail & Related papers (2024-01-01T20:33:27Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
We study efficient estimation and detection of a planted vector $v$ whose $ell_4$ norm differs from that of a Gaussian vector with the same $ell$ norm.
We show that in the regime $n rho gg sqrtN$, any spectral method from a large class (and more generally, any low-degree of the input) fails to detect the planted vector.
arXiv Detail & Related papers (2021-05-31T16:10:49Z) - Learners' languages [0.0]
Authors show that the fundamental elements of deep learning -- gradient descent and backpropagation -- can be conceptualized as a strong monoidal functor.
We show that a map $Ato B$ in $mathbfPara(mathbfSLens)$ has a natural interpretation in terms of dynamical systems.
arXiv Detail & Related papers (2021-03-01T18:34:00Z) - Learning a Lie Algebra from Unlabeled Data Pairs [7.329382191592538]
Deep convolutional networks (convnets) show a remarkable ability to learn disentangled representations.
This article proposes a machine learning method to discover a nonlinear transformation of the space $mathbbRn$.
The key idea is to approximate every target $boldsymboly_i$ by a matrix--vector product of the form $boldsymbolwidetildey_i = boldsymbolphi(t_i) boldsymbolx_i$.
arXiv Detail & Related papers (2020-09-19T23:23:52Z) - A Canonical Transform for Strengthening the Local $L^p$-Type Universal
Approximation Property [4.18804572788063]
$Lp$-type universal approximation theorems guarantee that a given machine learning model class $mathscrFsubseteq C(mathbbRd,mathbbRD)$ is dense in $Lp_mu(mathbbRd,mathbbRD)$.
This paper proposes a generic solution to this approximation theoretic problem by introducing a canonical transformation which "upgrades $mathscrF$'s approximation property"
arXiv Detail & Related papers (2020-06-24T17:46:35Z) - The Average-Case Time Complexity of Certifying the Restricted Isometry
Property [66.65353643599899]
In compressed sensing, the restricted isometry property (RIP) on $M times N$ sensing matrices guarantees efficient reconstruction of sparse vectors.
We investigate the exact average-case time complexity of certifying the RIP property for $Mtimes N$ matrices with i.i.d. $mathcalN(0,1/M)$ entries.
arXiv Detail & Related papers (2020-05-22T16:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.