Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding
- URL: http://arxiv.org/abs/2106.02795v1
- Date: Sat, 5 Jun 2021 04:40:18 GMT
- Title: Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding
- Authors: Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, Samy Bengio
- Abstract summary: We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
- Score: 96.9752763607738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attentional mechanisms are order-invariant. Positional encoding is a crucial
component to allow attention-based deep model architectures such as Transformer
to address sequences or images where the position of information matters. In
this paper, we propose a novel positional encoding method based on learnable
Fourier features. Instead of hard-coding each position as a token or a vector,
we represent each position, which can be multi-dimensional, as a trainable
encoding based on learnable Fourier feature mapping, modulated with a
multi-layer perceptron. The representation is particularly advantageous for a
spatial multi-dimensional position, e.g., pixel positions on an image, where
$L_2$ distances or more complex positional relationships need to be captured.
Our experiments based on several public benchmark tasks show that our learnable
Fourier feature representation for multi-dimensional positional encoding
outperforms existing methods by both improving the accuracy and allowing faster
convergence.
Related papers
- Improving Transformers using Faithful Positional Encoding [55.30212768657544]
We propose a new positional encoding method for a neural network architecture called the Transformer.
Unlike the standard sinusoidal positional encoding, our approach has a guarantee of not losing information about the positional order of the input sequence.
arXiv Detail & Related papers (2024-05-15T03:17:30Z) - GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks.
We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation.
We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z) - Coordinate Quantized Neural Implicit Representations for Multi-view
Reconstruction [28.910183274743872]
We introduce neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization.
We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering.
Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art.
arXiv Detail & Related papers (2023-08-21T20:27:33Z) - Trading Positional Complexity vs. Deepness in Coordinate Networks [33.90893096003318]
We show that alternative non-Fourier embedding functions can indeed be used for positional encoding.
Their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates.
We argue that employing a more complex positional encoding -- that scales exponentially with the number of modes -- requires only a linear (rather than deep) coordinate function to achieve comparable performance.
arXiv Detail & Related papers (2022-05-18T15:17:09Z) - PINs: Progressive Implicit Networks for Multi-Scale Neural
Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings.
Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail.
Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z) - Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning [8.944233327731245]
This paper proposes an improved Geometry Attention Transformer (GAT) model.
In order to further leverage geometric information, two novel geometry-aware architectures are designed.
Our GAT could often outperform current state-of-the-art image captioning models.
arXiv Detail & Related papers (2021-10-01T11:57:50Z) - Rethinking Positional Encoding [31.80055086317266]
We show that alternative non-Fourier embedding functions can indeed be used for positional encoding.
We show that their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates.
We present a more general theory to analyze positional encoding in terms of shifted basis functions.
arXiv Detail & Related papers (2021-07-06T12:04:04Z) - LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution
Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography.
Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges.
We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z) - Modulated Periodic Activations for Generalizable Local Functional
Representations [113.64179351957888]
We present a new representation that generalizes to multiple instances and achieves state-of-the-art fidelity.
Our approach produces general functional representations of images, videos and shapes, and achieves higher reconstruction quality than prior works that are optimized for a single signal.
arXiv Detail & Related papers (2021-04-08T17:59:04Z) - Attention-Based Multimodal Image Matching [16.335191345543063]
We propose an attention-based approach for multimodal image patch matching using a Transformer encoder.
Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues.
This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
arXiv Detail & Related papers (2021-03-20T21:14:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.