Related papers: Image Reconstruction as a Tool for Feature Analysis

Image Reconstruction as a Tool for Feature Analysis

URL: http://arxiv.org/abs/2506.07803v1
Date: Mon, 09 Jun 2025 14:32:18 GMT
Title: Image Reconstruction as a Tool for Feature Analysis
Authors: Eduard Allakhverdov, Dmitrii Tarasov, Elizaveta Goncharova, Andrey Kuznetsov,
Abstract summary: We propose a novel approach for interpreting vision features via image reconstruction.<n>We show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks.<n>Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space.
Score: 2.0249250133493195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

Related papers

Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models [27.806966289284528]
We present a unified framework using sparse autoencoders (SAEs) to discover human-interpretable visual features.<n>We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training.
arXiv Detail & Related papers (2025-02-10T18:32:41Z)
NARAIM: Native Aspect Ratio Autoregressive Image Models [26.26674614731835]
We propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio.<n>By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information.
arXiv Detail & Related papers (2024-10-13T21:13:48Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Neural architecture impact on identifying temporally extended Reinforcement Learning tasks [0.0]
We present Attention based architectures in reinforcement learning (RL) domain, capable of performing well on OpenAI Gym Atari- 2600 game suite. In Attention based models, extracting and overlaying of attention map onto images allows for direct observation of information used by agent to select actions. In addition, motivated by recent developments in attention based video-classification models using Vision Transformer, we come up with an architecture based on Vision Transformer, for image-based RL domain too.
arXiv Detail & Related papers (2023-10-04T21:09:19Z)
GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task [47.1857510710807]
We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations.<n>We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks.
arXiv Detail & Related papers (2023-06-01T14:02:45Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Learning to Resize Images for Computer Vision Tasks [15.381549764216134]
We show that the typical linear resizer can be replaced with learned resizers that can substantially improve performance. Our learned image resizer is jointly trained with a baseline vision model. We show that the proposed resizer can also be useful for fine-tuning the classification baselines for other vision tasks.
arXiv Detail & Related papers (2021-03-17T23:43:44Z)
Two-shot Spatially-varying BRDF and Shape Estimation [89.29020624201708]
We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF. We create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials. Experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
arXiv Detail & Related papers (2020-04-01T12:56:13Z)
Towards Coding for Human and Machine Vision: A Scalable Image Coding Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models. By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels. Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.