LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
- URL: http://arxiv.org/abs/2505.18051v1
- Date: Fri, 23 May 2025 15:56:35 GMT
- Title: LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
- Authors: Anthony Fuller, Yousef Yassin, Junfeng Wen, Daniel G. Kyrollos, Tarek Ibrahim, James R. Green, Evan Shelhamer,
- Abstract summary: Vision transformers are ever larger, more accurate, and more expensive to compute.<n>We turn to adaptive computation to cope with this cost by learning to predict where to compute.<n>Our LookWhere method divides a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input.
- Score: 10.461453853510964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.
Related papers
- AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer [13.945118817568366]
We introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference.<n>By representing the anchors with the neurons in a neural layer, we can differentiably learn these anchors and approximate global self-attention.<n>Experiments show the effectiveness of our AnchorFormer, achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification.
arXiv Detail & Related papers (2025-05-22T09:44:44Z) - High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z) - When Less is Enough: Adaptive Token Reduction for Efficient Image Representation [2.2120851074630177]
We introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones.<n>We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism.<n>Our results highlight a promising direction towards adaptive and efficient multimodal pruning.
arXiv Detail & Related papers (2025-03-20T19:17:08Z) - TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration [2.177039289023855]
Active Visual Exploration (AVE) optimize the utilization of robotic resources in real-world scenarios by sequentially selecting the most informative observations.
We introduce a novel approach to AVE called TOken REcycling (TORE)
It divides the encoder into extractor and aggregator components. The extractor processes each observation separately, enabling the reuse of tokens passed to the aggregator.
arXiv Detail & Related papers (2023-11-26T15:39:57Z) - Ideal Abstractions for Decision-Focused Learning [108.15241246054515]
We propose a method that configures the output space automatically in order to minimize the loss of decision-relevant information.
We demonstrate the method in two domains: data acquisition for deep neural network training and a closed-loop wildfire management task.
arXiv Detail & Related papers (2023-03-29T23:31:32Z) - Token Pooling in Vision Transformers [37.11990688046186]
In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
arXiv Detail & Related papers (2021-10-08T02:22:50Z) - Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume
Excitation [65.83008812026635]
We construct Guided Cost volume Excitation (GCE) and show that simple channel excitation of cost volume guided by image can improve performance considerably.
We present an end-to-end network that we call Correlate-and-Excite (CoEx)
arXiv Detail & Related papers (2021-08-12T14:32:26Z) - Sample and Computation Redistribution for Efficient Face Detection [137.19388513633484]
Training data sampling and computation distribution strategies are the keys to efficient and accurate face detection.
scrfdf34 outperforms the best competitor, TinaFace, by $3.86%$ (AP at hard set) while being more than emph3$times$ faster on GPUs with VGA-resolution images.
arXiv Detail & Related papers (2021-05-10T23:51:14Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - Unsupervised Learning of Visual Features by Contrasting Cluster
Assignments [57.33699905852397]
We propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons.
Our method simultaneously clusters the data while enforcing consistency between cluster assignments.
Our method can be trained with large and small batches and can scale to unlimited amounts of data.
arXiv Detail & Related papers (2020-06-17T14:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.