Related papers: Hilbert Flattening: a Locality-Preserving Matrix Unfolding Method for Visual Discrimination

Hilbert Flattening: a Locality-Preserving Matrix Unfolding Method for Visual Discrimination

URL: http://arxiv.org/abs/2202.10240v7
Date: Tue, 30 Jan 2024 06:56:43 GMT
Title: Hilbert Flattening: a Locality-Preserving Matrix Unfolding Method for Visual Discrimination
Authors: Qingsong Zhao, Yi Wang, Zhipeng Zhou, Duoqian Miao, Limin Wang, Yu Qiao, Cairong Zhao
Abstract summary: We propose Hilbert curve flattening as an innovative method to preserve locality in flattened matrices. We also introduce the Localformer, a vision transformer architecture that incorporates token sampling with a token aggregator to enhance its locality bias.
Score: 51.432453379052724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Flattening is essential in computer vision by converting multi-dimensional feature maps or images into one-dimensional vectors. However, existing flattening approaches neglect the preservation of local smoothness, which can impact the representational learning capacity of vision models. In this paper, we propose Hilbert curve flattening as an innovative method to preserve locality in flattened matrices. We compare it with the commonly used Zigzag operation and demonstrate that Hilbert curve flattening can better retain the spatial relationships and local smoothness of the original grid structure, while maintaining robustness against the input scale variance. And, we introduce the Localformer, a vision transformer architecture that incorporates Hilbert token sampling with a token aggregator to enhance its locality bias. Extensive experiments on image classification and semantic segmentation tasks demonstrate that the Localformer outperforms baseline models consistently. We also show it brings consistent performance boosts for other popular architectures (e.g. MLP-Mixer).

Related papers

MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models [25.406556604989607]
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs)<n> misalignment between multimodal features identified as a key contributing factor.<n>MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling.
arXiv Detail & Related papers (2025-07-12T08:09:35Z)
REOrdering Patches Improves Vision Models [50.24865821590156]
We show that patch order significantly affects model performance in such settings.<n>We propose REOrder, a framework for discovering task-optimal patch orderings.<n>ReOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
arXiv Detail & Related papers (2025-05-29T17:59:30Z)
Vector Field Attention for Deformable Image Registration [9.852055065890479]
Deformable image registration establishes non-linear spatial correspondences between fixed and moving images. Most existing deep learning-based methods require neural networks to encode location information in their feature maps. We present Vector Field Attention (VFA), a novel framework that enhances the efficiency of the existing network design by enabling direct retrieval of location correspondences.
arXiv Detail & Related papers (2024-07-14T14:06:58Z)
Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP. VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone. Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z)
Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment [50.892158511845466]
We show that gradient consistency in the field, indicated by the parallelism of level sets, is the key factor affecting the inference accuracy. We propose a level set alignment loss to evaluate the parallelism of level sets, which can be minimized to achieve better gradient consistency.
arXiv Detail & Related papers (2023-05-19T11:28:05Z)
A Geometrically Constrained Point Matching based on View-invariant Cross-ratios, and Homography [2.050924050557755]
A geometrically constrained algorithm is proposed to verify the correctness of initially matched SIFT keypoints based on view-invariant cross-ratios (CRs) By randomly forming pentagons from these keypoints and matching their shape and location among images with CRs, robust planar region estimation can be achieved efficiently. Experimental results show that satisfactory results can be obtained for various scenes with single as well as multiple planar regions.
arXiv Detail & Related papers (2022-11-06T01:55:35Z)
Neural Space-filling Curves [47.852964985588486]
We present a data-driven approach to infer a context-based scan order for a set of images. Our work learns a spatially coherent linear ordering of pixels from the dataset of images using a graph-based neural network. We show the advantage of using Neural SFCs in downstream applications such as image compression.
arXiv Detail & Related papers (2022-04-18T17:59:01Z)
UltraSR: Spatial Encoding is a Missing Key for Implicit Image Function-based Arbitrary-Scale Super-Resolution [74.82282301089994]
In this work, we propose UltraSR, a simple yet effective new network design based on implicit image functions. We show that spatial encoding is indeed a missing key towards the next-stage high-accuracy implicit image function. Our UltraSR sets new state-of-the-art performance on the DIV2K benchmark under all super-resolution scales.
arXiv Detail & Related papers (2021-03-23T17:36:42Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.