Related papers: From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

URL: http://arxiv.org/abs/2107.07999v8
Date: Tue, 28 Mar 2023 03:50:33 GMT
Title: From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers
Authors: Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten
Abstract summary: We show that recent results on linear causal attention are special cases of this general mechanism. By casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before.
Score: 44.074479731587786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.

Related papers

Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition [0.0]
Existing sequence to sequence models for structured language tasks rely heavily on the dot product self attention mechanism.<n>We introduce the Graph Wavelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi scale wavelet transform.
arXiv Detail & Related papers (2025-05-09T00:13:23Z)
Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers [22.641550077885686]
Looped Transformers have shown exceptional neural algorithmic reasoning capability in traditional graph algorithms. We extend the Loop Transformer architecture's neural algorithmic reasoning capability to simulate hypergraph algorithms.
arXiv Detail & Related papers (2025-01-18T07:58:45Z)
Convergence Guarantees for the DeepWalk Embedding on Block Models [9.898607871253775]
We show how to use the DeepWalk algorithm on graphs obtained from the Block Model (SBM) Despite being simplistic, the SBM has proved to be a classic model for analyzing algorithms on large graphs.
arXiv Detail & Related papers (2024-10-26T18:35:11Z)
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection [6.759148939470332]
We show that algorithms converge in to a hard-margin SVM with an $ell_p$-norm objective. Specifically, we show that these algorithms converge in to a generalized hard-margin SVM with an $ell_p$-norm objective.
arXiv Detail & Related papers (2024-10-18T16:32:06Z)
Linear Transformer Topological Masking with Graph Random Features [52.717865653036796]
We show how to parameterise topological masks as a learnable function of a weighted adjacency matrix. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data.
arXiv Detail & Related papers (2024-10-04T14:24:06Z)
Quantum Maximum Entropy Inference and Hamiltonian Learning [4.9614587340495]
This work extends algorithms for maximum entropy inference and learning of graphical models to the quantum realm. The generalization, known as quantum iterative scaling (QIS), is straightforward, but the key challenge lies in the non-commutative nature of quantum problem instances. We explore quasi-Newton methods to enhance the performance of QIS and GD.
arXiv Detail & Related papers (2024-07-16T08:11:34Z)
Discrete Graph Auto-Encoder [52.50288418639075]
We introduce a new framework named Discrete Graph Auto-Encoder (DGAE) We first use a permutation-equivariant auto-encoder to convert graphs into sets of discrete latent node representations. In the second step, we sort the sets of discrete latent representations and learn their distribution with a specifically designed auto-regressive model.
arXiv Detail & Related papers (2023-06-13T12:40:39Z)
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure. We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z)
Joint Network Topology Inference via Structured Fusion Regularization [70.30364652829164]
Joint network topology inference represents a canonical problem of learning multiple graph Laplacian matrices from heterogeneous graph signals. We propose a general graph estimator based on a novel structured fusion regularization. We show that the proposed graph estimator enjoys both high computational efficiency and rigorous theoretical guarantee.
arXiv Detail & Related papers (2021-03-05T04:42:32Z)
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets. We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
Scaling Graph Clustering with Distributed Sketches [1.1011268090482575]
We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing projections. We show that our method produces embeddings that yield performant clustering results given a fully-dynamic block model stream. We also discuss the effects of block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could significantly improve the performance of graph clustering in distributed memory.
arXiv Detail & Related papers (2020-07-24T17:38:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.