Related papers: Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

URL: http://arxiv.org/abs/2508.17081v1
Date: Sat, 23 Aug 2025 16:39:09 GMT
Title: Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry
Authors: Haoyu Yun, Hamid Krim,
Abstract summary: Vision Transformer (ViT) has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks.<n>This paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach.<n> Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.
Score: 7.3623134099785155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define sections within the tangent bundle and project data from tangent spaces onto the base space, achieving global feature alignment and optimization. Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.

Related papers

HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment [84.65251073657883]
We propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry.<n>First, we extract Euclidean features using CLIP and map them to hyperbolic space.<n>Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision.<n>Third, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters.
arXiv Detail & Related papers (2026-01-08T05:41:06Z)
Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds [49.95082206008502]
Alignment across Trees is a method that constructs and aligns tree-like hierarchical features for both image and text modalities.<n>We introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers.
arXiv Detail & Related papers (2025-10-31T11:32:15Z)
VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction [26.668204454537246]
We introduce textbfVisual Gaussian Driving (VGD), a novel feed-forward end-to-end learning framework designed to address this challenge.<n>We show that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings.
arXiv Detail & Related papers (2025-10-22T13:28:49Z)
SegMASt3R: Geometry Grounded Segment Matching [23.257530861472656]
We leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching.<n>We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change rotation.
arXiv Detail & Related papers (2025-10-06T17:31:32Z)
Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions [2.8199098530835127]
Vision Transformers have demonstrated remarkable success in computer vision tasks.<n>Traditional positional encoding approaches fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances.<n>We propose WEF-PE, a mathematically principled approach that directly addresses embedding two-dimensional coordinates through natural complex domain representation.
arXiv Detail & Related papers (2025-08-26T16:14:59Z)
Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z)
Decouple before Align: Visual Disentanglement Enhances Prompt Tuning [85.91474962071452]
Prompt tuning (PT) has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models.<n>This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context.<n>We propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept.
arXiv Detail & Related papers (2025-08-01T07:46:00Z)
Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z)
FlexPara: Flexible Neural Surface Parameterization [71.65203972602673]
This paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations.<n>We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities, to construct a bi-directional cycle mapping framework for global parameterization.<n>Experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm.
arXiv Detail & Related papers (2025-04-27T12:30:08Z)
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data [14.104497777255137]
We introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations.<n>We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies.<n> Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models.
arXiv Detail & Related papers (2025-03-17T05:42:19Z)
HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space [1.1858475445768824]
This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and M"obius transformations. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization.
arXiv Detail & Related papers (2024-09-25T13:07:37Z)
Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph [45.115555973941255]
Relative pose estimation is crucial for various computer vision applications, including Robotic and Autonomous Driving. We propose a Geometric Correspondence Graph neural network that integrates point features with extra structured line segments. This integration of matched points and line segments further exploits the geometry constraints and enhances model performance across different environments.
arXiv Detail & Related papers (2024-08-28T12:33:26Z)
Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries. We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z)
Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis [8.20832544370228]
We introduce a domain-agnostic architecture to study any surface data projected onto a spherical manifold. A vision transformer model encodes the sequence of patches via successive multi-head self-attention layers. Experiments show that the SiT generally outperforms surface CNNs, while performing comparably on registered and unregistered data.
arXiv Detail & Related papers (2022-03-30T15:56:11Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.