Aligning Transformers with Weisfeiler-Leman
- URL: http://arxiv.org/abs/2406.03148v1
- Date: Wed, 5 Jun 2024 11:06:33 GMT
- Title: Aligning Transformers with Weisfeiler-Leman
- Authors: Luis Müller, Christopher Morris,
- Abstract summary: Graph neural network architectures aligned with the $k$-WL hierarchy offer theoretically well-understood expressive power.
We develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE.
We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art.
- Score: 5.0452971570315235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.
Related papers
- Circuit Complexity Bounds for RoPE-based Transformer Architecture [25.2590541420499]
Empirical evidence suggests that $mathsfRoPE$-based Transformer architectures demonstrate greater generalization capabilities.
We show that unless $mathsfTC0 = mathsfNC1$, a $mathsfRoPE$-based Transformer with $mathrmpoly(n)$-precision, $O(1)$ layers, hidden dimension $d leq O(n)$ cannot solve the arithmetic problem.
arXiv Detail & Related papers (2024-11-12T07:24:41Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Automatic Graph Topology-Aware Transformer [50.2807041149784]
We build a comprehensive graph Transformer search space with the micro-level and macro-level designs.
EGTAS evolves graph Transformer topologies at the macro level and graph-aware strategies at the micro level.
We demonstrate the efficacy of EGTAS across a range of graph-level and node-level tasks.
arXiv Detail & Related papers (2024-05-30T07:44:31Z) - On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers [20.73012427295352]
We study the theoretical expressive power of order-$k$ graph transformers and sparse variants.
We show that a natural neighborhood-based sparse order-$k$ transformer model is not only computationally efficient, but also expressive.
arXiv Detail & Related papers (2024-04-04T11:26:51Z) - Towards Principled Graph Transformers [8.897857788525629]
Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power.
We show that the proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power.
arXiv Detail & Related papers (2024-01-18T16:50:55Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.