Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features
- URL: http://arxiv.org/abs/2509.16629v2
- Date: Tue, 23 Sep 2025 17:52:27 GMT
- Title: Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features
- Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun,
- Abstract summary: CAPE is a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG)<n>The DAG is embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach.<n>This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism.
- Score: 2.945172427769856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.
Related papers
- Features Emerge as Discrete States: The First Application of SAEs to 3D Representations [5.751184796461698]
Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations.<n>We present the first application of SAEs to the 3D domain, analyzing the features used by a state-of-the-art 3D reconstruction VAE applied to 53k 3D models.
arXiv Detail & Related papers (2025-12-12T03:54:45Z) - Learnable Spatial-Temporal Positional Encoding for Link Prediction [44.0907827498725]
We propose a simple temporal link prediction model named L-STEP.<n>L-STEP can preserve the graph property from the spatial-temporal spectral viewpoint.<n>L-STEP obtains the leading performance in the newest large-scale TGB benchmark.
arXiv Detail & Related papers (2025-06-10T00:35:53Z) - Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z) - Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z) - On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent [51.50999191584981]
Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam.<n>We study how SignGD optimize a two-layer transformer on a noisy dataset.<n>We find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks.
arXiv Detail & Related papers (2024-10-07T09:36:43Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding [0.0]
In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens.
We introduce Hyperbolic Positional Attention (HyPE), a novel method that utilizes hyperbolic functions' properties to encode tokens' relative positions.
arXiv Detail & Related papers (2023-10-30T15:54:32Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - Fundamental Limits of Two-layer Autoencoders, and Achieving Them with
Gradient Methods [91.54785981649228]
This paper focuses on non-linear two-layer autoencoders trained in the challenging proportional regime.
Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods.
For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders.
arXiv Detail & Related papers (2022-12-27T12:37:34Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Stochastic tensor space feature theory with applications to robust machine learning [3.6891975755608355]
We develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on tensor spaces.<n>Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces.<n>Tests in the blood plasma dataset (Alzheimer's Disease Neuroimaging Initiative) show dramatic increases in accuracy.
arXiv Detail & Related papers (2021-10-04T22:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.