Rethinking Positional Encoding in Language Pre-training
- URL: http://arxiv.org/abs/2006.15595v4
- Date: Mon, 15 Mar 2021 07:56:22 GMT
- Title: Rethinking Positional Encoding in Language Pre-training
- Authors: Guolin Ke, Di He, Tie-Yan Liu
- Abstract summary: We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
- Score: 111.2320727291926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the positional encoding methods used in language
pre-training (e.g., BERT) and identify several problems in the existing
formulations. First, we show that in the absolute positional encoding, the
addition operation applied on positional embeddings and word embeddings brings
mixed correlations between the two heterogeneous information resources. It may
bring unnecessary randomness in the attention and further limit the
expressiveness of the model. Second, we question whether treating the position
of the symbol \texttt{[CLS]} the same as other words is a reasonable design,
considering its special role (the representation of the entire sentence) in the
downstream tasks. Motivated from above analysis, we propose a new positional
encoding method called \textbf{T}ransformer with \textbf{U}ntied
\textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module,
TUPE computes the word contextual correlation and positional correlation
separately with different parameterizations and then adds them together. This
design removes the mixed and noisy correlations over heterogeneous embeddings
and offers more expressiveness by using different projection matrices.
Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making
it easier to capture information from all positions. Extensive experiments and
ablation studies on GLUE benchmark demonstrate the effectiveness of the
proposed method. Codes and models are released at
https://github.com/guolinke/TUPE.
Related papers
- PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer [51.260384040953326]
Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
arXiv Detail & Related papers (2024-07-10T15:42:58Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style)
We uncover the core function of PEs by identifying two common properties, Locality and Symmetry.
We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Automated Feature-Topic Pairing: Aligning Semantic and Embedding Spaces
in Spatial Representation Learning [28.211312371895]
This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework.
Specifically, we formulate the problem into an automated alignment task between 1) a latent embedding feature space and 2) a semantic topic space.
We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics.
arXiv Detail & Related papers (2021-09-22T21:55:36Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - Logic Constrained Pointer Networks for Interpretable Textual Similarity [11.142649867439406]
We introduce a novel pointer network based model with a sentinel gating function to align constituent chunks.
We improve this base model with a loss function to equally penalize misalignments in both sentences, ensuring the alignments are bidirectional.
The model achieves an F1 score of 97.73 and 96.32 on the benchmark SemEval datasets for the chunk alignment task.
arXiv Detail & Related papers (2020-07-15T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.