Learning interpretable positional encodings in transformers depends on initialization
- URL: http://arxiv.org/abs/2406.08272v4
- Date: Mon, 23 Jun 2025 15:01:16 GMT
- Title: Learning interpretable positional encodings in transformers depends on initialization
- Authors: Takuya Ito, Luca Cocchi, Tim Klinger, Parikshit Ram, Murray Campbell, Luke Hearne,
- Abstract summary: positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence.<n>We show that the choice of a learnable PE greatly influences its ability to learn interpretable PEs.<n>We find that a learned PE from a small-norm distribution can uncover interpretable PEs that mirror ground truth positions in multiple dimensions.
- Score: 14.732076081683418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved generalization. These results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.
Related papers
- SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z) - LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers [0.0]
Positional embeddings play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention.
Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings.
We propose LOOPE, a learnable patch-ordering method that optimize spatial representation for a given set of frequencies.
arXiv Detail & Related papers (2025-04-19T19:20:47Z) - Learning Efficient Positional Encodings with Graph Neural Networks [109.8653020407373]
We introduce PEARL, a novel framework of learnable PEs for graphs.
PEARL approximates equivariant functions of eigenvectors with linear complexity, while rigorously establishing its stability and high expressive power.
Our analysis demonstrates that PEARL approximates equivariant functions of eigenvectors with linear complexity, while rigorously establishing its stability and high expressive power.
arXiv Detail & Related papers (2025-02-03T07:28:53Z) - Multi-Surrogate-Teacher Assistance for Representation Alignment in Fingerprint-based Indoor Localization [0.5199807441687141]
We propose a plug-and-play framework for learning transferable representations among Received Signal Strength ( RSS) fingerprint datasets.<n>This work includes two main phases: Expert Training and Expert Distilling.<n>Experiments conducted on three benchmark WiFi RSS fingerprint datasets underscore the effectiveness of the framework.
arXiv Detail & Related papers (2024-12-13T22:00:26Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Emergence of a High-Dimensional Abstraction Phase in Language Transformers [47.60397331657208]
A language model (LM) is a mapping from a linguistic context to an output token.<n>We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets.<n>Our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.
arXiv Detail & Related papers (2024-05-24T11:49:07Z) - Exploring the Role of Token in Transformer-based Time Series Forecasting [10.081240480138487]
Transformer-based methods are a mainstream approach for solving time series forecasting (TSF)
Most focus on optimizing the model structure, with few studies paying attention to the role of tokens for predictions.
We find that the gradients mainly depend on tokens that contribute to the predicted series, called positive tokens.
To utilize T-PE and V-PE, we propose T2B-PE, a Transformer-based dual-branch framework.
arXiv Detail & Related papers (2024-04-16T07:21:39Z) - Natural Language Processing Through Transfer Learning: A Case Study on
Sentiment Analysis [1.14219428942199]
This paper explores the potential of transfer learning in natural language processing focusing mainly on sentiment analysis.
The claim is that, compared to training models from scratch, transfer learning, using pre-trained BERT models, can increase sentiment classification accuracy.
arXiv Detail & Related papers (2023-11-28T17:12:06Z) - Clairvoyance: A Pipeline Toolkit for Medical Time Series [95.22483029602921]
Time-series learning is the bread and butter of data-driven *clinical decision support*
Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a software toolkit.
Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
arXiv Detail & Related papers (2023-10-28T12:08:03Z) - The Locality and Symmetry of Positional Encodings [9.246374019271938]
We conduct a systematic study of positional encodings in textbfBi Masked Language Models (BERT-style)
We uncover the core function of PEs by identifying two common properties, Locality and Symmetry.
We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly.
arXiv Detail & Related papers (2023-10-19T16:15:15Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - The Impact of Positional Encoding on Length Generalization in
Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z) - SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network for Efficient
and Generalizable Protein-Protein Interaction Prediction [16.203794286288815]
Protein-protein interactions (PPIs) are crucial in various biological processes and their study has significant implications for drug development and disease diagnosis.
Existing deep learning methods suffer from significant performance degradation under complex real-world scenarios.
We propose a self-ensembling multigraph neural network (SemiGNN-PPI) that can effectively predict PPIs while being both efficient and generalizable.
arXiv Detail & Related papers (2023-05-15T03:06:44Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - An Extension to Basis-Hypervectors for Learning from Circular Data in
Hyperdimensional Computing [62.997667081978825]
Hyperdimensional Computing (HDC) is a computation framework based on properties of high-dimensional random spaces.
We present a study on basis-hypervector sets, which leads to practical contributions to HDC in general.
We introduce a method to learn from circular data, an important type of information never before addressed in machine learning with HDC.
arXiv Detail & Related papers (2022-05-16T18:04:55Z) - PhysFormer: Facial Video-based Physiological Measurement with Temporal
Difference Transformer [55.936527926778695]
Recent deep learning approaches focus on mining subtle r clues using convolutional neural networks with limited-temporal receptive fields.
In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture.
arXiv Detail & Related papers (2021-11-23T18:57:11Z) - Measuring Generalization with Optimal Transport [111.29415509046886]
We develop margin-based generalization bounds, where the margins are normalized with optimal transport costs.
Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets.
arXiv Detail & Related papers (2021-06-07T03:04:59Z) - More data or more parameters? Investigating the effect of data structure
on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters.
We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z) - PyraPose: Feature Pyramids for Fast and Accurate Object Pose Estimation
under Domain Shift [26.037061005620263]
We argue that patch-based approaches, instead of encoder-decoder networks, are more suited for synthetic-to-real transfer.
We present a novel approach based on a specialized feature pyramid network to compute multi-scale features for creating pose hypotheses.
Our single-shot pose estimation approach is evaluated on multiple standard datasets and outperforms the state of the art by up to 35%.
arXiv Detail & Related papers (2020-10-30T08:26:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.