Vision encoders should be image size agnostic and task driven
- URL: http://arxiv.org/abs/2508.16317v1
- Date: Fri, 22 Aug 2025 11:57:49 GMT
- Title: Vision encoders should be image size agnostic and task driven
- Authors: Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool,
- Abstract summary: We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not.<n>It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image.
- Score: 60.09702846704075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait -- efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We -- humans and animals -- deal with vast quantities of visual data, and need to be smart where we focus our limited energy -- it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision -- a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
Related papers
- In Pursuit of Pixel Supervision for Visual Pre-training [60.63095313440605]
"Pixio" is an enhanced masked autoencoder (MAE) trained on 2B web-crawled images with a self-curation strategy with minimal human curation.<n>Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation, feed-forward 3D reconstruction, semantic segmentation, and robot learning.<n>Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
arXiv Detail & Related papers (2025-12-17T18:59:58Z) - Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z) - Traces of Image Memorability in Vision Encoders: Activations, Attention Distributions and Autoencoder Losses [5.369009163979958]
This paper explores the correlates of image memorability in pretrained vision encoders.<n>We find that these features correlate with memorability to some extent.<n>Results shed light on the relationship between model-internal features and memorability.
arXiv Detail & Related papers (2025-09-01T13:11:59Z) - Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes [8.941800684473202]
We introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across more than 1,000 sparse-view indoor scenarios.<n>Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions.<n>We propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM.
arXiv Detail & Related papers (2025-06-20T07:42:26Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - A computational approach to visual ecology with deep reinforcement
learning [6.635611625764804]
This paper lays the foundation for a computational approach to visual ecology.
It shows how representations and behaviour emerge from an agent's drive for survival.
arXiv Detail & Related papers (2024-02-07T21:23:47Z) - Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking
Neural Network [8.380017457339756]
Human vision incorporates non-uniform resolution retina, efficient eye movement strategy, and spiking neural network (SNN) to balance the requirements in visual field size, visual resolution, energy cost, and inference latency.
Here, we examine human visual search behaviors and establish the first SNN-based visual search model.
The model can learn either a human-like or a near-optimal fixation strategy, outperform humans in search speed and accuracy, and achieve high energy efficiency through short saccade decision latency and sparse activation.
arXiv Detail & Related papers (2023-10-10T12:39:10Z) - InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions.
InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks.
It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z) - Visualizing and Understanding Contrastive Learning [22.553990823550784]
We design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images.
We also adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations.
arXiv Detail & Related papers (2022-06-20T13:01:46Z) - Visual Attention Network [90.0753726786985]
We propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention.
We also introduce a novel neural network based on LKA, namely Visual Attention Network (VAN)
VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments.
arXiv Detail & Related papers (2022-02-20T06:35:18Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z) - Towards Coding for Human and Machine Vision: A Scalable Image Coding
Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models.
By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels.
Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.