Related papers: CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

URL: http://arxiv.org/abs/2405.14722v1
Date: Thu, 23 May 2024 15:51:24 GMT
Title: CAPE: Context-Adaptive Positional Encoding for Length Extrapolation
Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li,
Abstract summary: Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. We propose a Context-Adaptive Positional. CAPE method, which adjusts semantically based on input context and learned priors. We successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods.
Score: 60.18239094672938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be context-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Context-Adaptive Positional Encoding (CAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that CAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.

Related papers

SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
Learnable Spatial-Temporal Positional Encoding for Link Prediction [44.0907827498725]
We propose a simple temporal link prediction model named L-STEP.<n>L-STEP can preserve the graph property from the spatial-temporal spectral viewpoint.<n>L-STEP obtains the leading performance in the newest large-scale TGB benchmark.
arXiv Detail & Related papers (2025-06-10T00:35:53Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
Context-aware Biases for Length Extrapolation [0.0]
We propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE)<n>By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs.<n>Our method significantly enhances the performance of existing RPE methods tested on the FineWeb-Edu10B and WikiText-103 datasets.
arXiv Detail & Related papers (2025-03-11T05:54:58Z)
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
Transformers rely on both content-based and position-based addressing mechanisms to make predictions. TAPE is a novel framework that enhances positional embeddings by incorporating sequence content across layers. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm. FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z)
Condition-Invariant Semantic Segmentation [77.10045325743644]
We implement Condition-Invariant Semantic (CISS) on the current state-of-the-art domain adaptation architecture. Our method achieves the second-best performance on the normal-to-adverse Cityscapes$to$ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night.
arXiv Detail & Related papers (2023-05-27T03:05:07Z)
Improving Position Encoding of Transformers for Multivariate Time Series Classification [5.467400475482668]
We propose a new absolute position encoding method dedicated to time series data called time Absolute Position. We then propose a novel time series classification (MTSC) model combining tAPE/eRPE and convolution-based input encoding named ConvTran to improve the position and data embedding of time series data.
arXiv Detail & Related papers (2023-05-26T05:30:04Z)
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching [64.30749838423922]
We propose Adaptive Spot-Guided Transformer (ASTR) for local feature matching. ASTR models the local consistency and scale variations in a unified coarse-to-fine architecture.
arXiv Detail & Related papers (2023-03-29T12:28:01Z)
Transformers for End-to-End InfoSec Tasks: A Feasibility Study [6.847381178288385]
We implement transformer models for two distinct InfoSec data formats - specifically URLs and PE files. We show that our URL transformer model requires a different training approach to reach high performance levels. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets.
arXiv Detail & Related papers (2022-12-05T23:50:46Z)
Back to the Source: Diffusion-Driven Test-Time Adaptation [77.4229736436935]
Test-time adaptation harnesses test inputs to improve accuracy of a model trained on source data when tested on shifted target data. We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model.
arXiv Detail & Related papers (2022-07-07T17:14:10Z)
Adaptive Fine-Tuning of Transformer-Based Language Models for Named Entity Recognition [0.0]
The current standard approach for fine-tuning language models includes a fixed number of training epochs and a linear learning rate schedule. In this paper, we introduce adaptive fine-tuning, which is an alternative approach that uses early stopping and a custom learning rate schedule.
arXiv Detail & Related papers (2022-02-05T19:20:03Z)
Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation [127.6168183074427]
We propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID. This is challenging because it requires the model to continuously adapt to unlabeled data of the target environments. We design an effective scheme for this task, dubbed CLUDA-ReID, where the anti-forgetting is harmoniously coordinated with the adaptation.
arXiv Detail & Related papers (2021-12-13T13:19:45Z)
Adaptive L2 Regularization in Person Re-Identification [0.9195729979000402]
We introduce an adaptive L2 regularization mechanism in the setting of person re-identification. Experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework.
arXiv Detail & Related papers (2020-07-15T17:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.