Rethinking Positional Encoding in Language Pre-training
- URL: http://arxiv.org/abs/2006.15595v4
- Date: Mon, 15 Mar 2021 07:56:22 GMT
- Title: Rethinking Positional Encoding in Language Pre-training
- Authors: Guolin Ke, Di He, Tie-Yan Liu
- Abstract summary: We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
- Score: 111.2320727291926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the positional encoding methods used in language
pre-training (e.g., BERT) and identify several problems in the existing
formulations. First, we show that in the absolute positional encoding, the
addition operation applied on positional embeddings and word embeddings brings
mixed correlations between the two heterogeneous information resources. It may
bring unnecessary randomness in the attention and further limit the
expressiveness of the model. Second, we question whether treating the position
of the symbol \texttt{[CLS]} the same as other words is a reasonable design,
considering its special role (the representation of the entire sentence) in the
downstream tasks. Motivated from above analysis, we propose a new positional
encoding method called \textbf{T}ransformer with \textbf{U}ntied
\textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module,
TUPE computes the word contextual correlation and positional correlation
separately with different parameterizations and then adds them together. This
design removes the mixed and noisy correlations over heterogeneous embeddings
and offers more expressiveness by using different projection matrices.
Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making
it easier to capture information from all positions. Extensive experiments and
ablation studies on GLUE benchmark demonstrate the effectiveness of the
proposed method. Codes and models are released at
https://github.com/guolinke/TUPE.
Related papers
- SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z) - Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
Transformers rely on both content-based and position-based addressing mechanisms to make predictions.
TAPE is a novel framework that enhances positional embeddings by incorporating sequence content across layers.
Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z) - PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer [51.260384040953326]
Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
arXiv Detail & Related papers (2024-07-10T15:42:58Z) - Contextual Position Encoding: Learning to Count What's Important [42.038277620194]
We propose a new position encoding method, Contextual Position Flop (CoPE)
CoPE allows positions to be conditioned on context by incrementing position on certain tokens determined by the model.
We show that CoPE can solve the selective copy, counting and Flip-Flop tasks where popular position embeddings fail.
arXiv Detail & Related papers (2024-05-29T02:57:15Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding [0.0]
In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens.
We introduce Hyperbolic Positional Attention (HyPE), a novel method that utilizes hyperbolic functions' properties to encode tokens' relative positions.
arXiv Detail & Related papers (2023-10-30T15:54:32Z) - The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style)
We uncover the core function of PEs by identifying two common properties, Locality and Symmetry.
We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Improve Transformer Pre-Training with Decoupled Directional Relative
Position Encoding and Representation Differentiations [23.2969212998404]
We revisit the Transformer-based pre-trained language models and identify two problems that may limit the expressiveness of the model.
Existing relative position encoding models confuse two heterogeneous information: relative distance and direction.
We propose two novel techniques to improve pre-trained language models.
arXiv Detail & Related papers (2022-10-09T12:35:04Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Automated Feature-Topic Pairing: Aligning Semantic and Embedding Spaces
in Spatial Representation Learning [28.211312371895]
This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework.
Specifically, we formulate the problem into an automated alignment task between 1) a latent embedding feature space and 2) a semantic topic space.
We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics.
arXiv Detail & Related papers (2021-09-22T21:55:36Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - Logic Constrained Pointer Networks for Interpretable Textual Similarity [11.142649867439406]
We introduce a novel pointer network based model with a sentinel gating function to align constituent chunks.
We improve this base model with a loss function to equally penalize misalignments in both sentences, ensuring the alignments are bidirectional.
The model achieves an F1 score of 97.73 and 96.32 on the benchmark SemEval datasets for the chunk alignment task.
arXiv Detail & Related papers (2020-07-15T13:01:44Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.