Addressing Token Uniformity in Transformers via Singular Value
Transformation
- URL: http://arxiv.org/abs/2208.11790v2
- Date: Tue, 19 Dec 2023 03:32:26 GMT
- Title: Addressing Token Uniformity in Transformers via Singular Value
Transformation
- Authors: Hanqi Yan, Lin Gui, Wenjie Li, Yulan He
- Abstract summary: Token uniformity is commonly observed in transformer-based models.
We show that a less skewed singular value distribution can alleviate the token uniformity' problem.
- Score: 24.039280291845706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Token uniformity is commonly observed in transformer-based models, in which
different tokens share a large proportion of similar information after going
through stacked multiple self-attention layers in a transformer. In this paper,
we propose to use the distribution of singular values of outputs of each
transformer layer to characterise the phenomenon of token uniformity and
empirically illustrate that a less skewed singular value distribution can
alleviate the `token uniformity' problem. Base on our observations, we define
several desirable properties of singular value distributions and propose a
novel transformation function for updating the singular values. We show that
apart from alleviating token uniformity, the transformation function should
preserve the local neighbourhood structure in the original embedding space. Our
proposed singular value transformation function is applied to a range of
transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT,
and improved performance is observed in semantic textual similarity evaluation
and a range of GLUE tasks. Our source code is available at
https://github.com/hanqi-qi/tokenUni.git.
Related papers
- Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components.
Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z) - Transformers are Universal In-context Learners [21.513210412394965]
We show that deep transformers can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains.
A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens.
arXiv Detail & Related papers (2024-08-02T16:21:48Z) - EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer.
It provides a unified theoretical framework to formulate both semantic difference and positional difference.
It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z) - Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity.
We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens.
We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z) - Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks.
We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in
Transformer-Based Variational AutoEncoder for Diverse Text Generation [85.5379146125199]
Variational Auto-Encoder (VAE) has been widely adopted in text generation.
We propose TRACE, a Transformer-based recurrent VAE structure.
arXiv Detail & Related papers (2022-10-22T10:25:35Z) - Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z) - Consistency Regularization for Variational Auto-Encoders [14.423556966548544]
Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning.
We propose a regularization method to enforce consistency in VAEs.
arXiv Detail & Related papers (2021-05-31T10:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.