Related papers: Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding

Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding

URL: http://arxiv.org/abs/2106.02795v1
Date: Sat, 5 Jun 2021 04:40:18 GMT
Title: Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding
Authors: Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, Samy Bengio
Abstract summary: We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
Score: 96.9752763607738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

Related papers

Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z)
SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes.<n>We introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z)
Improving Transformers using Faithful Positional Encoding [55.30212768657544]
We propose a new positional encoding method for a neural network architecture called the Transformer. Unlike the standard sinusoidal positional encoding, our approach has a guarantee of not losing information about the positional order of the input sequence.
arXiv Detail & Related papers (2024-05-15T03:17:30Z)
GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks. We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation. We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z)
Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction [28.910183274743872]
We introduce neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art.
arXiv Detail & Related papers (2023-08-21T20:27:33Z)
Trading Positional Complexity vs. Deepness in Coordinate Networks [33.90893096003318]
We show that alternative non-Fourier embedding functions can indeed be used for positional encoding. Their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates. We argue that employing a more complex positional encoding -- that scales exponentially with the number of modes -- requires only a linear (rather than deep) coordinate function to achieve comparable performance.
arXiv Detail & Related papers (2022-05-18T15:17:09Z)
PINs: Progressive Implicit Networks for Multi-Scale Neural Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings. Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail. Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z)
Geometry Attention Transformer with Position-aware LSTMs for Image Captioning [8.944233327731245]
This paper proposes an improved Geometry Attention Transformer (GAT) model. In order to further leverage geometric information, two novel geometry-aware architectures are designed. Our GAT could often outperform current state-of-the-art image captioning models.
arXiv Detail & Related papers (2021-10-01T11:57:50Z)
Rethinking Positional Encoding [31.80055086317266]
We show that alternative non-Fourier embedding functions can indeed be used for positional encoding. We show that their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates. We present a more general theory to analyze positional encoding in terms of shifted basis functions.
arXiv Detail & Related papers (2021-07-06T12:04:04Z)
LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography. Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges. We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z)
Modulated Periodic Activations for Generalizable Local Functional Representations [113.64179351957888]
We present a new representation that generalizes to multiple instances and achieves state-of-the-art fidelity. Our approach produces general functional representations of images, videos and shapes, and achieves higher reconstruction quality than prior works that are optimized for a single signal.
arXiv Detail & Related papers (2021-04-08T17:59:04Z)
Attention-Based Multimodal Image Matching [16.335191345543063]
We propose an attention-based approach for multimodal image patch matching using a Transformer encoder. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
arXiv Detail & Related papers (2021-03-20T21:14:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.