Related papers: Do computer vision foundation models learn the low-level characteristics of the human visual system?

Do computer vision foundation models learn the low-level characteristics of the human visual system?

URL: http://arxiv.org/abs/2502.20256v2
Date: Tue, 11 Mar 2025 21:52:23 GMT
Title: Do computer vision foundation models learn the low-level characteristics of the human visual system?
Authors: Yancheng Cai, Fei Yin, Dounia Hammou, Rafal Mantiuk,
Abstract summary: Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets.<n>The question we address is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system.
Score: 12.938875245555952
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on vision tasks start to align with low-level human vision, with DINOv2 showing the closest resemblance.

Related papers

When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect [2.3622884172290255]
Recent research looking for human-like typicality effects in language and vision models has focused on models of a single modality. This study expands this behavioral evaluation of models by considering a broader range of language and vision models. It also evaluates whether the combined typicality predictions of vision + language model pairs, as well as a multimodal CLIP-based model, are better aligned with human typicality judgments than those of models of either modality alone.
arXiv Detail & Related papers (2024-05-25T08:38:30Z)
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images. We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z)
Divergences in Color Perception between Deep Neural Networks and Humans [3.0315685825606633]
We develop experiments for evaluating the perceptual coherence of color embeddings in deep neural networks (DNNs) We assess how well these algorithms predict human color similarity judgments collected via an online survey. We compare DNN performance against an interpretable and cognitively plausible model of color perception based on wavelet decomposition.
arXiv Detail & Related papers (2023-09-11T20:26:40Z)
ColorSense: A Study on Color Vision in Machine Visual Recognition [57.916512479603064]
We collect 110,000 non-trivial human annotations of foreground and background color labels from visual recognition benchmarks.<n>We validate the use of our datasets by demonstrating that the level of color discrimination has a dominating effect on the performance of machine perception models.<n>Our findings suggest that object recognition tasks such as classification and localization are susceptible to color vision bias.
arXiv Detail & Related papers (2022-12-16T18:51:41Z)
Human alignment of neural network representations [28.32452075196472]
We investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses.<n>We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses.<n>We find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not.
arXiv Detail & Related papers (2022-11-02T15:23:16Z)
A domain adaptive deep learning solution for scanpath prediction of paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z)
Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
Do DNNs trained on Natural Images acquire Gestalt Properties? [0.6091702876917281]
Deep Neural Networks (DNNs) trained on natural images have been proposed as compelling models of human vision. We compared human and DNN responses in discrimination judgments. We found that network trained on natural images exhibited sensitivity to shapes at the last stage of classification.
arXiv Detail & Related papers (2022-03-14T17:06:11Z)
Neural Re-Rendering of Humans from a Single Image [80.53438609047896]
We propose a new method for neural re-rendering of a human under a novel user-defined pose and viewpoint. Our algorithm represents body pose and shape as a parametric mesh which can be reconstructed from a single image.
arXiv Detail & Related papers (2021-01-11T18:53:47Z)
A Psychophysically Oriented Saliency Map Prediction Model [4.884688557957589]
We propose a new psychophysical saliency prediction architecture, WECSF, inspired by multi-channel model of visual cortex functioning in humans. The proposed model is evaluated using several datasets, including the MIT1003, MIT300, Toronto, SID4VAM, and UCF Sports datasets. Our model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images and dynamic videos.
arXiv Detail & Related papers (2020-11-08T20:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.