Improve Transformer Pre-Training with Decoupled Directional Relative
Position Encoding and Representation Differentiations
- URL: http://arxiv.org/abs/2210.04246v1
- Date: Sun, 9 Oct 2022 12:35:04 GMT
- Title: Improve Transformer Pre-Training with Decoupled Directional Relative
Position Encoding and Representation Differentiations
- Authors: Haojie Zhang, Mingfei Liang, Ruobing Xie, Zhenlong Sun, Bo Zhang, Leyu
Lin
- Abstract summary: We revisit the Transformer-based pre-trained language models and identify two problems that may limit the expressiveness of the model.
Existing relative position encoding models confuse two heterogeneous information: relative distance and direction.
We propose two novel techniques to improve pre-trained language models.
- Score: 23.2969212998404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we revisit the Transformer-based pre-trained language models
and identify two problems that may limit the expressiveness of the model.
Firstly, existing relative position encoding models (e.g., T5 and DEBERTA)
confuse two heterogeneous information: relative distance and direction. It may
make the model unable to capture the associative semantics of the same
direction or the same distance, which in turn affects the performance of
downstream tasks. Secondly, we notice the pre-trained BERT with Mask Language
Modeling (MLM) pre-training objective outputs similar token representations and
attention weights of different heads, which may impose difficulties in
capturing discriminative semantic representations. Motivated by the above
investigation, we propose two novel techniques to improve pre-trained language
models: Decoupled Directional Relative Position (DDRP) encoding and MTH
pre-training objective. DDRP decouples the relative distance features and the
directional features in classical relative position encoding for better
position information understanding. MTH designs two novel auxiliary losses
besides MLM to enlarge the dissimilarities between (a) last hidden states of
different tokens, and (b) attention weights of different heads, alleviating
homogenization and anisotropic problem in representation learning for better
optimization. Extensive experiments and ablation studies on GLUE benchmark
demonstrate the effectiveness of our proposed methods.
Related papers
- Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders [6.7181844004432385]
The Inter-Intra Modal Measure (IIMM) functions as a strong predictor of performance changes with fine-tuning.
Fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation.
With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning.
arXiv Detail & Related papers (2024-07-22T15:35:09Z) - Morphing Tokens Draw Strong Masked Image Models [28.356863521946607]
Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs)
We introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets.
DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs.
arXiv Detail & Related papers (2023-12-30T14:53:09Z) - Location Sensitive Embedding for Knowledge Graph Reasoning [0.0]
Key challenge in translational distance models is their inability to effectively differentiate between 'head' and 'tail' entities in graphs.
To address this problem, a novel location-sensitive embedding (LSE) method has been developed.
LSE innovatively modifies the head entity using relation-specific mappings, conceptualizing relations as linear transformations rather than mere translations.
Experiments conducted on four large-scale KG datasets for link prediction show LSEd either outperforms or is competitive with state-of-the-art related works.
arXiv Detail & Related papers (2023-12-01T22:35:19Z) - Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens [9.590540796223715]
In this paper, we attempt to explore the in-context learning process in Transformers through a lens of representation learning.
The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions.
We extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers.
arXiv Detail & Related papers (2023-10-20T01:55:34Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning.
We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - SLUA: A Super Lightweight Unsupervised Word Alignment Model via
Cross-Lingual Contrastive Learning [79.91678610678885]
We propose a super lightweight unsupervised word alignment model (SLUA)
Experimental results on several public benchmarks demonstrate that our model achieves competitive, if not better, performance.
Notably, we recognize our model as a pioneer attempt to unify bilingual word embedding and word alignments.
arXiv Detail & Related papers (2021-02-08T05:54:11Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Towards Accurate Knowledge Transfer via Target-awareness Representation
Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED)
TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model.
Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.