Related papers: REOrdering Patches Improves Vision Models

REOrdering Patches Improves Vision Models

URL: http://arxiv.org/abs/2505.23751v1
Date: Thu, 29 May 2025 17:59:30 GMT
Title: REOrdering Patches Improves Vision Models
Authors: Declan Kutscher, David M. Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta,
Abstract summary: We show that patch order significantly affects model performance in such settings.<n>We propose REOrder, a framework for discovering task-optimal patch orderings.<n>ReOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
Score: 50.24865821590156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

Related papers

Autoregressive Image Generation with Randomized Parallel Decoding [23.714192351237628]
ARPG is a novel visual autoregressive model that enables randomized parallel generation.<n>Our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput.
arXiv Detail & Related papers (2025-03-13T17:19:51Z)
Texture, Shape and Order Matter: A New Transformer Design for Sequential DeepFake Detection [57.100891917805086]
Sequential DeepFake detection is an emerging task that predicts the manipulation sequence in order.<n>This paper describes a new Transformer design, called TSOM, by exploring three perspectives: Texture, Shape, and Order of Manipulations.
arXiv Detail & Related papers (2024-04-22T04:47:52Z)
A Strong Baseline for Point Cloud Registration via Direct Superpoints Matching [7.308509114539376]
We propose a simple and effective baseline to find correspondences of superpoints in a global matching manner. Our simple yet effective baseline shows comparable or even better results than state-of-the-art methods on three datasets.
arXiv Detail & Related papers (2023-07-03T21:33:40Z)
Ray-Patch: An Efficient Querying for Light Field Transformers [10.859910783551937]
We propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views. Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models.
arXiv Detail & Related papers (2023-05-16T16:03:27Z)
Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks. We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)
ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning [0.0]
We propose a new method called Expansion Mechanism'' which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length. We exploit such method and achieve competitive performances on the MS-COCO 2014 data set.
arXiv Detail & Related papers (2022-07-07T14:37:02Z)
HIPA: Hierarchical Patch Transformer for Single Image Super Resolution [62.7081074931892]
This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition. We build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution. Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions.
arXiv Detail & Related papers (2022-03-19T05:09:34Z)
Short Range Correlation Transformer for Occluded Person Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT. The proposed PFT utilizes three modules to enhance the efficiency of vision transformer. Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z)
NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW. It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z)
Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z)
IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.