Related papers: Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

URL: http://arxiv.org/abs/2204.10670v1
Date: Fri, 22 Apr 2022 12:35:08 GMT
Title: Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
Authors: Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang
Abstract summary: Self-attention is a widely used building block in neural modeling to mix long-range data elements. We propose a novel scalable and effective mixing building block called Paramixer. The overall computing cost of the new building block is as low as $O(N log N)$.
Score: 9.205331586765613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require $O(N^2)$ computing cost for sequence length $N$. Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as $O(N \log N)$. Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck. We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks. The experimental results show that Paramixer has better performance in most learning tasks.

Related papers

Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$. We first design novel efficient algorithms to privately release pair-wise cosine similarities. We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z)
FeDXL: Provable Federated Learning for Deep X-Risk Optimization [105.17383135458897]
We tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing algorithms are applicable. The challenges for designing an FL algorithm for X-risks lie in the non-decomability of the objective over multiple machines and the interdependency between different machines.
arXiv Detail & Related papers (2022-10-26T00:23:36Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Learning Best Combination for Efficient N:M Sparsity [75.34103761423803]
N:M learning can be naturally characterized as a problem which searches for the best combination within a finite collection. We show that our learning best combination (LBC) performs consistently better than off-the-shelf N:M sparsity methods across various networks.
arXiv Detail & Related papers (2022-06-14T07:51:31Z)
Computationally Efficient Approximations for Matrix-based Renyi's Entropy [33.72108955447222]
Recently developed matrix based Renyi's entropy enables measurement of information in data. computation of such quantity involves the trace operator on a PSD matrix $G$ to power $alpha$(i.e., $tr(Galpha)$. We present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n2)$.
arXiv Detail & Related papers (2021-12-27T14:59:52Z)
Joint Majorization-Minimization for Nonnegative Matrix Factorization with the $\beta$-divergence [4.468952886990851]
This article proposes new multiplicative updates for nonnegative matrix factorization (NMF) with the $beta$-divergence objective function. We report experimental results using diverse datasets: face images, an audio spectrogram, hyperspectral data and song play counts.
arXiv Detail & Related papers (2021-06-29T09:58:21Z)
Robust Model Selection and Nearly-Proper Learning for GMMs [26.388358539260473]
In learning theory, a standard assumption is that the data is generated from a finite mixture model. But what happens when the number of components is not known in advance? We are able to approximately determine the minimum number of components needed to fit the distribution within a logarithmic factor.
arXiv Detail & Related papers (2021-06-05T01:58:40Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting. We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$. Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.