Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions
- URL: http://arxiv.org/abs/2508.19167v1
- Date: Tue, 26 Aug 2025 16:14:59 GMT
- Title: Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions
- Authors: Zhihang Xin, Xitong Hu, Rui Wang,
- Abstract summary: Vision Transformers have demonstrated remarkable success in computer vision tasks.<n>Traditional positional encoding approaches fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances.<n>We propose WEF-PE, a mathematically principled approach that directly addresses embedding two-dimensional coordinates through natural complex domain representation.
- Score: 2.8199098530835127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78\% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28\% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.
Related papers
- RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Dense Semantic Matching with VGGT Prior [49.42199006453071]
We propose an approach that retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences.<n>Our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
arXiv Detail & Related papers (2025-09-25T14:56:11Z) - Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry [7.3623134099785155]
Vision Transformer (ViT) has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks.<n>This paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach.<n> Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.
arXiv Detail & Related papers (2025-08-23T16:39:09Z) - Geometric Operator Learning with Optimal Transport [77.16909146519227]
We propose integrating optimal transport (OT) into operator learning for partial differential equations (PDEs) on complex geometries.<n>For 3D simulations focused on surfaces, our OT-based neural operator embeds the surface geometry into a 2D parameterized latent space.<n> Experiments with Reynolds-averaged Navier-Stokes equations (RANS) on the ShapeNet-Car and DrivAerNet-Car datasets show that our method achieves better accuracy and also reduces computational expenses.
arXiv Detail & Related papers (2025-07-26T21:28:25Z) - Enforcing Latent Euclidean Geometry in Single-Cell VAEs for Manifold Interpolation [79.27003481818413]
We introduce FlatVI, a training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry.<n>By encouraging straight lines in the latent space to approximate geodesics on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches.
arXiv Detail & Related papers (2025-07-15T23:08:14Z) - Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z) - GeloVec: Higher Dimensional Geometric Smoothing for Coherent Visual Feature Extraction in Image Segmentation [0.0]
GeloVec is a new CNN-based attention smoothing framework for semantic segmentation.<n>It implements a higher-dimensional geometric smoothing method to establish a robust manifold relationships between visually coherent regions.<n>Our framework exhibits strong generalization capabilities across disciplines due to the absence of information loss during transformations.
arXiv Detail & Related papers (2025-05-02T07:07:00Z) - Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z) - Shape-informed surrogate models based on signed distance function domain encoding [8.052704959617207]
We propose a non-intrusive method to build surrogate models that approximate the solution of parameterized partial differential equations (PDEs)
Our approach is based on the combination of two neural networks (NNs)
arXiv Detail & Related papers (2024-09-19T01:47:04Z) - Thinner Latent Spaces: Detecting Dimension and Imposing Invariance with Conformal Autoencoders [8.743941823307967]
We show that orthogonality relations within the latent layer of the network can be leveraged to infer the intrinsic dimensionality of nonlinear manifold data sets.<n>We outline the relevant theory relying on differential geometry, and describe the corresponding gradient-descent optimization algorithm.
arXiv Detail & Related papers (2024-08-28T20:56:35Z) - Neural Isometries: Taming Transformations for Equivariant ML [8.203292895010748]
We introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space.
We show that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks.
arXiv Detail & Related papers (2024-05-29T17:24:25Z) - Towards Geometric-Photometric Joint Alignment for Facial Mesh Registration [3.1932242398896964]
This paper presents a Geometric-Photometric Joint Alignment(GPJA) method.<n>It aligns discrete human expressions at pixel-level accuracy by combining geometric and photometric information.<n>This consistency benefits face animation, re-parametrization, and other batch operations for face modeling and applications with enhanced efficiency.
arXiv Detail & Related papers (2024-03-05T03:39:23Z) - Solving High-Dimensional PDEs with Latent Spectral Models [74.1011309005488]
We present Latent Spectral Models (LSM) toward an efficient and precise solver for high-dimensional PDEs.
Inspired by classical spectral methods in numerical analysis, we design a neural spectral block to solve PDEs in the latent space.
LSM achieves consistent state-of-the-art and yields a relative gain of 11.5% averaged on seven benchmarks.
arXiv Detail & Related papers (2023-01-30T04:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.