Related papers: Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

URL: http://arxiv.org/abs/2602.01418v1
Date: Sun, 01 Feb 2026 19:51:27 GMT
Title: Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
Authors: Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis,
Abstract summary: We propose Parabolic Position, a parabola-based position encoding for vision modalities in attention-based architectures.<n>We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets.
Score: 10.805953214146166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

Related papers

HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score [14.857585045577165]
HIVTP is a training-free method to improve Vision-Language Models (VLMs) inference efficiency.<n>We propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens.<n> Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively.
arXiv Detail & Related papers (2025-09-28T05:53:39Z)
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [49.122200327049676]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>When extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens.<n>We introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases.
arXiv Detail & Related papers (2025-05-22T09:05:01Z)
Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding [64.29499221878746]
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence.<n>PyPE is a novel approach designed to enhance the perception of visual tokens withinVLMs.<n>Our method reduces the relative distance between interrelated visual elements and instruction tokens.
arXiv Detail & Related papers (2025-01-19T07:00:46Z)
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment [40.328294121805456]
This work builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone.
arXiv Detail & Related papers (2023-12-13T22:20:45Z)
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers. By contrast, the discrete tokens in NLP field are naturally highly semantic. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.