Transformers Generalize DeepSets and Can be Extended to Graphs and
Hypergraphs
- URL: http://arxiv.org/abs/2110.14416v1
- Date: Wed, 27 Oct 2021 13:20:05 GMT
- Title: Transformers Generalize DeepSets and Can be Extended to Graphs and
Hypergraphs
- Authors: Jinwoo Kim, Saeyoon Oh, Seunghoon Hong
- Abstract summary: We present a generalization of Transformers to any-order permutation invariant data (sets, graphs, and hypergraphs)
In particular, we show that the sparse second-order Transformers with kernel attention are theoretically more expressive than message passing operations.
Our models achieve significant improvement over invariants and message-passing graph neural networks in large-scale graph regression and set-to-(hyper)graph prediction tasks.
- Score: 15.844680924751984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a generalization of Transformers to any-order permutation
invariant data (sets, graphs, and hypergraphs). We begin by observing that
Transformers generalize DeepSets, or first-order (set-input) permutation
invariant MLPs. Then, based on recently characterized higher-order invariant
MLPs, we extend the concept of self-attention to higher orders and propose
higher-order Transformers for order-$k$ data ($k=2$ for graphs and $k>2$ for
hypergraphs). Unfortunately, higher-order Transformers turn out to have
prohibitive complexity $\mathcal{O}(n^{2k})$ to the number of input nodes $n$.
To address this problem, we present sparse higher-order Transformers that have
quadratic complexity to the number of input hyperedges, and further adopt the
kernel attention approach to reduce the complexity to linear. In particular, we
show that the sparse second-order Transformers with kernel attention are
theoretically more expressive than message passing operations while having an
asymptotically identical complexity. Our models achieve significant performance
improvement over invariant MLPs and message-passing graph neural networks in
large-scale graph regression and set-to-(hyper)graph prediction tasks. Our
implementation is available at https://github.com/jw9730/hot.
Related papers
- Learning Graph Quantized Tokenizers for Transformers [28.79505338383552]
Graph Transformers (GTs) have emerged as a leading model in deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks.
We introduce GQT (textbfGraph textbfQuantized textbfTokenizer), which decouples tokenizer training from Transformer training by leveraging graph self-supervised learning.
By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 16 out of 18 benchmarks, including large-scale homophilic and heterophilic datasets.
arXiv Detail & Related papers (2024-10-17T17:38:24Z) - Masked Graph Transformer for Large-Scale Recommendation [56.37903431721977]
We propose an efficient Masked Graph Transformer, named MGFormer, capable of capturing all-pair interactions among nodes with a linear complexity.
Experimental results show the superior performance of our MGFormer, even with a single attention layer.
arXiv Detail & Related papers (2024-05-07T06:00:47Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - Pure Transformers are Powerful Graph Learners [51.36884247453605]
We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice.
We prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers.
Our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results.
arXiv Detail & Related papers (2022-07-06T08:13:06Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Transformers are RNNs: Fast Autoregressive Transformers with Linear
Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences.
We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.