Related papers: What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

URL: http://arxiv.org/abs/2010.04903v1
Date: Sat, 10 Oct 2020 05:03:14 GMT
Title: What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding
Authors: Yu-An Wang, Yun-Nung Chen
Abstract summary: This paper focuses on providing a new insight of pre-trained position embeddings through feature-level analysis and empirical experiments on most of iconic NLP tasks. It is believed that our experimental results can guide the future work to choose the suitable positional encoding function for specific tasks given the application property.
Score: 42.011175069706816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, pre-trained Transformers have dominated the majority of NLP benchmark tasks. Many variants of pre-trained Transformers have kept breaking out, and most focus on designing different pre-training objectives or variants of self-attention. Embedding the position information in the self-attention mechanism is also an indispensable factor in Transformers however is often discussed at will. Therefore, this paper carries out an empirical study on position embeddings of mainstream pre-trained Transformers, which mainly focuses on two questions: 1) Do position embeddings really learn the meaning of positions? 2) How do these different learned position embeddings affect Transformers for NLP tasks? This paper focuses on providing a new insight of pre-trained position embeddings through feature-level analysis and empirical experiments on most of iconic NLP tasks. It is believed that our experimental results can guide the future work to choose the suitable positional encoding function for specific tasks given the application property.

Related papers

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z)
Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders [0.0]
Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer.
arXiv Detail & Related papers (2025-02-23T21:29:50Z)
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z)
Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task [40.85615657802704]
This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures.
arXiv Detail & Related papers (2024-05-30T20:52:23Z)
Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMs [18.832135309689736]
Recent advances in large language models (LLMs) have enhanced their ability to process long input contexts. Recent studies show a positional bias in LLMs, demonstrating varying performance depending on the location of useful information. We develop a Position-Aware PAPEFT approach which is composed of a data augmentation technique and an efficient parameter adapter.
arXiv Detail & Related papers (2024-04-01T19:04:17Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
Evaluating Prompt-based Question Answering for Object Prediction in the Open Research Knowledge Graph [0.0]
This work reports results on adopting prompt-based training of transformers for textitscholarly knowledge graph object prediction It deviates from the other works proposing entity and relation extraction pipelines for predicting objects of a scholarly knowledge graph. We find that (i) per expectations, transformer models when tested out-of-the-box underperform on a new domain of data, (ii) prompt-based training of the models achieve performance boosts of up to 40% in a relaxed evaluation setting.
arXiv Detail & Related papers (2023-05-22T10:35:18Z)
Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it. Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z)
Paragraph-based Transformer Pre-training for Multi-Sentence Inference [99.59693674455582]
We show that popular pre-trained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks. We then propose a new pre-training objective that models the paragraph-level semantics across multiple input sentences.
arXiv Detail & Related papers (2022-05-02T21:41:14Z)
Training ELECTRA Augmented with Multi-word Selection [53.77046731238381]
We present a new text encoder pre-training method that improves ELECTRA based on multi-task learning. Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets.
arXiv Detail & Related papers (2021-05-31T23:19:00Z)
Position Information in Transformers: An Overview [6.284464997330884]
This paper provides an overview of common methods to incorporate position information into Transformer models. The objectives of this survey are to showcase that position information in Transformer is a vibrant and extensive research area.
arXiv Detail & Related papers (2021-02-22T15:03:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.