Sampled Transformer for Point Sets
- URL: http://arxiv.org/abs/2302.14346v1
- Date: Tue, 28 Feb 2023 06:38:05 GMT
- Title: Sampled Transformer for Point Sets
- Authors: Shidi Li, Christian Walder, Alexander Soen, Lexing Xie, Miaomiao Liu
- Abstract summary: sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions.
We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
- Score: 80.66097006145999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sparse transformer can reduce the computational complexity of the
self-attention layers to $O(n)$, whilst still being a universal approximator of
continuous sequence-to-sequence functions. However, this permutation variant
operation is not appropriate for direct application to sets. In this paper, we
proposed an $O(n)$ complexity sampled transformer that can process point set
elements directly without any additional inductive bias. Our sampled
transformer introduces random element sampling, which randomly splits point
sets into subsets, followed by applying a shared Hamiltonian self-attention
mechanism to each subset. The overall attention mechanism can be viewed as a
Hamiltonian cycle in the complete attention graph, and the permutation of point
set elements is equivalent to randomly sampling Hamiltonian cycles. This
mechanism implements a Monte Carlo simulation of the $O(n^2)$ dense attention
connections. We show that it is a universal approximator for continuous
set-to-set functions. Experimental results on point-clouds show comparable or
better accuracy with significantly reduced computational complexity compared to
the dense transformer or alternative sparse attention schemes.
Related papers
- Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.
We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z) - Universal algorithm for transforming Hamiltonian eigenvalues [0.7499722271664144]
We provide a new way of manipulating Hamiltonians, by transforming their eigenvalues while keeping their eigenstates unchanged.
We develop a universal algorithm that deterministically implements any desired function on the eigenvalues of any unknown Hamiltonian.
We present a universal algorithm to transform positive-time to negative-time dynamics without adding an auxiliary qubit.
arXiv Detail & Related papers (2023-12-14T12:06:12Z) - Improving Gradient-guided Nested Sampling for Posterior Inference [47.08481529384556]
We present a performant, general-purpose gradient-guided nested sampling algorithm, $tt GGNS$.
We show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution.
arXiv Detail & Related papers (2023-12-06T21:09:18Z) - Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to
Self-attention [27.850970793739933]
We show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer.
From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention.
arXiv Detail & Related papers (2022-07-27T08:20:00Z) - Stochastic Zeroth order Descent with Structured Directions [10.604744518360464]
We introduce and analyze Structured Zeroth order Descent (SSZD), a finite difference approach that approximates a gradient on a set $lleq d directions, where $d is the dimension of the ambient space.
For convex convex we prove almost sure convergence of functions on $O( (d/l) k-c1/2$)$ for every $c1/2$, which is arbitrarily close to the one of the Gradient Descent (SGD) in terms of one number of iterations.
arXiv Detail & Related papers (2022-06-10T14:00:06Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Coherent randomized benchmarking [68.8204255655161]
We show that superpositions of different random sequences rather than independent samples are used.
We show that this leads to a uniform and simple protocol with significant advantages with respect to gates that can be benchmarked.
arXiv Detail & Related papers (2020-10-26T18:00:34Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.