From block-Toeplitz matrices to differential equations on graphs:
towards a general theory for scalable masked Transformers
- URL: http://arxiv.org/abs/2107.07999v8
- Date: Tue, 28 Mar 2023 03:50:33 GMT
- Title: From block-Toeplitz matrices to differential equations on graphs:
towards a general theory for scalable masked Transformers
- Authors: Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit
Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian
Weller, Thomas Weingarten
- Abstract summary: We show that recent results on linear causal attention are special cases of this general mechanism.
By casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before.
- Score: 44.074479731587786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we provide, to the best of our knowledge, the first
comprehensive approach for incorporating various masking mechanisms into
Transformers architectures in a scalable way. We show that recent results on
linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention
(Luo et al., 2021) are special cases of this general mechanism. However by
casting the problem as a topological (graph-based) modulation of unmasked
attention, we obtain several results unknown before, including efficient
d-dimensional RPE-masking and graph-kernel masking. We leverage many
mathematical techniques ranging from spectral analysis through dynamic
programming and random walks to new algorithms for solving Markov processes on
graphs. We provide a corresponding empirical evaluation.
Related papers
- Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition [0.0]
Existing sequence to sequence models for structured language tasks rely heavily on the dot product self attention mechanism.<n>We introduce the Graph Wavelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi scale wavelet transform.
arXiv Detail & Related papers (2025-05-09T00:13:23Z) - Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers [22.641550077885686]
Looped Transformers have shown exceptional neural algorithmic reasoning capability in traditional graph algorithms.
We extend the Loop Transformer architecture's neural algorithmic reasoning capability to simulate hypergraph algorithms.
arXiv Detail & Related papers (2025-01-18T07:58:45Z) - Convergence Guarantees for the DeepWalk Embedding on Block Models [9.898607871253775]
We show how to use the DeepWalk algorithm on graphs obtained from the Block Model (SBM)
Despite being simplistic, the SBM has proved to be a classic model for analyzing algorithms on large graphs.
arXiv Detail & Related papers (2024-10-26T18:35:11Z) - Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection [6.759148939470332]
We show that algorithms converge in to a hard-margin SVM with an $ell_p$-norm objective.
Specifically, we show that these algorithms converge in to a generalized hard-margin SVM with an $ell_p$-norm objective.
arXiv Detail & Related papers (2024-10-18T16:32:06Z) - Linear Transformer Topological Masking with Graph Random Features [52.717865653036796]
We show how to parameterise topological masks as a learnable function of a weighted adjacency matrix.
Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data.
arXiv Detail & Related papers (2024-10-04T14:24:06Z) - Quantum Maximum Entropy Inference and Hamiltonian Learning [4.9614587340495]
This work extends algorithms for maximum entropy inference and learning of graphical models to the quantum realm.
The generalization, known as quantum iterative scaling (QIS), is straightforward, but the key challenge lies in the non-commutative nature of quantum problem instances.
We explore quasi-Newton methods to enhance the performance of QIS and GD.
arXiv Detail & Related papers (2024-07-16T08:11:34Z) - Discrete Graph Auto-Encoder [52.50288418639075]
We introduce a new framework named Discrete Graph Auto-Encoder (DGAE)
We first use a permutation-equivariant auto-encoder to convert graphs into sets of discrete latent node representations.
In the second step, we sort the sets of discrete latent representations and learn their distribution with a specifically designed auto-regressive model.
arXiv Detail & Related papers (2023-06-13T12:40:39Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Joint Network Topology Inference via Structured Fusion Regularization [70.30364652829164]
Joint network topology inference represents a canonical problem of learning multiple graph Laplacian matrices from heterogeneous graph signals.
We propose a general graph estimator based on a novel structured fusion regularization.
We show that the proposed graph estimator enjoys both high computational efficiency and rigorous theoretical guarantee.
arXiv Detail & Related papers (2021-03-05T04:42:32Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Scaling Graph Clustering with Distributed Sketches [1.1011268090482575]
We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing projections.
We show that our method produces embeddings that yield performant clustering results given a fully-dynamic block model stream.
We also discuss the effects of block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could significantly improve the performance of graph clustering in distributed memory.
arXiv Detail & Related papers (2020-07-24T17:38:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.