Related papers: Sampled Transformer for Point Sets

Sampled Transformer for Point Sets

URL: http://arxiv.org/abs/2302.14346v1
Date: Tue, 28 Feb 2023 06:38:05 GMT
Title: Sampled Transformer for Point Sets
Authors: Shidi Li, Christian Walder, Alexander Soen, Lexing Xie, Miaomiao Liu
Abstract summary: sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
Score: 80.66097006145999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is not appropriate for direct application to sets. In this paper, we proposed an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias. Our sampled transformer introduces random element sampling, which randomly splits point sets into subsets, followed by applying a shared Hamiltonian self-attention mechanism to each subset. The overall attention mechanism can be viewed as a Hamiltonian cycle in the complete attention graph, and the permutation of point set elements is equivalent to randomly sampling Hamiltonian cycles. This mechanism implements a Monte Carlo simulation of the $O(n^2)$ dense attention connections. We show that it is a universal approximator for continuous set-to-set functions. Experimental results on point-clouds show comparable or better accuracy with significantly reduced computational complexity compared to the dense transformer or alternative sparse attention schemes.

Related papers

Exact Sequence Classification with Hardmax Transformers [0.0]
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbbRd$, $dgeq 2$. Specifically, given $N$ sequences with an arbitrary but finite length in $mathbbRd$, we construct a transformer with $mathcalO(N)$ blocks and $mathcalO(Nd)$ parameters perfectly classifying this dataset.
arXiv Detail & Related papers (2025-02-04T12:31:00Z)
Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism. We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z)
Universal algorithm for transforming Hamiltonian eigenvalues [0.7499722271664144]
We provide a new way of manipulating Hamiltonians, by transforming their eigenvalues while keeping their eigenstates unchanged. We develop a universal algorithm that deterministically implements any desired function on the eigenvalues of any unknown Hamiltonian. We present a universal algorithm to transform positive-time to negative-time dynamics without adding an auxiliary qubit.
arXiv Detail & Related papers (2023-12-14T12:06:12Z)
Improving Gradient-guided Nested Sampling for Posterior Inference [47.08481529384556]
We present a performant, general-purpose gradient-guided nested sampling algorithm, $tt GGNS$. We show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution.
arXiv Detail & Related papers (2023-12-06T21:09:18Z)
Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention [27.850970793739933]
We show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention.
arXiv Detail & Related papers (2022-07-27T08:20:00Z)
Stochastic Zeroth order Descent with Structured Directions [10.604744518360464]
We introduce and analyze Structured Zeroth order Descent (SSZD), a finite difference approach that approximates a gradient on a set $lleq d directions, where $d is the dimension of the ambient space. For convex convex we prove almost sure convergence of functions on $O( (d/l) k-c1/2$)$ for every $c1/2$, which is arbitrarily close to the one of the Gradient Descent (SGD) in terms of one number of iterations.
arXiv Detail & Related papers (2022-06-10T14:00:06Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Coherent randomized benchmarking [68.8204255655161]
We show that superpositions of different random sequences rather than independent samples are used. We show that this leads to a uniform and simple protocol with significant advantages with respect to gates that can be benchmarked.
arXiv Detail & Related papers (2020-10-26T18:00:34Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.