Related papers: Aligning Transformers with Weisfeiler-Leman

Aligning Transformers with Weisfeiler-Leman

URL: http://arxiv.org/abs/2406.03148v1
Date: Wed, 5 Jun 2024 11:06:33 GMT
Title: Aligning Transformers with Weisfeiler-Leman
Authors: Luis Müller, Christopher Morris,
Abstract summary: Graph neural network architectures aligned with the $k$-WL hierarchy offer theoretically well-understood expressive power. We develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art.
Score: 5.0452971570315235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Graph neural network architectures aligned with the $k$-dimensional Weisfeiler--Leman ($k$-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the $k$-WL hierarchy have shown promising empirical results, employing transformers for higher orders of $k$ remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the $k$-WL hierarchy, showing stronger expressivity results for each $k$, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at https://github.com/luis-mueller/wl-transformers.

Related papers

Plain Transformers Can be Powerful Graph Learners [64.50059165186701]
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers have strayed far from plain Transformers.<n>This work demonstrates that the plain Transformer architecture can be a powerful graph learner.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction [29.12836710966048]
We propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. Our results call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.
arXiv Detail & Related papers (2024-12-23T18:59:21Z)
Circuit Complexity Bounds for RoPE-based Transformer Architecture [25.2590541420499]
Empirical evidence suggests that $mathsfRoPE$-based Transformer architectures demonstrate greater generalization capabilities. We show that unless $mathsfTC0 = mathsfNC1$, a $mathsfRoPE$-based Transformer with $mathrmpoly(n)$-precision, $O(1)$ layers, hidden dimension $d leq O(n)$ cannot solve the arithmetic problem.
arXiv Detail & Related papers (2024-11-12T07:24:41Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Automatic Graph Topology-Aware Transformer [50.2807041149784]
We build a comprehensive graph Transformer search space with the micro-level and macro-level designs. EGTAS evolves graph Transformer topologies at the macro level and graph-aware strategies at the micro level. We demonstrate the efficacy of EGTAS across a range of graph-level and node-level tasks.
arXiv Detail & Related papers (2024-05-30T07:44:31Z)
On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers [20.73012427295352]
We study the theoretical expressive power of order-$k$ graph transformers and sparse variants. We show that a natural neighborhood-based sparse order-$k$ transformer model is not only computationally efficient, but also expressive.
arXiv Detail & Related papers (2024-04-04T11:26:51Z)
Towards Principled Graph Transformers [8.897857788525629]
Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. We show that the proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power.
arXiv Detail & Related papers (2024-01-18T16:50:55Z)
On Structural Expressive Power of Graph Transformers [26.84413369075598]
We introduce textbfSEG-WL test (textbfStructural textbfEncoding enhanced textbfWeisfeiler-textbfLehman test), a generalized graph isomorphism test algorithm. We show how graph Transformers' expressive power is determined by the design of structural encodings.
arXiv Detail & Related papers (2023-05-23T12:12:21Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.