CNNs and Transformers Perceive Hybrid Images Similar to Humans
- URL: http://arxiv.org/abs/2203.11678v1
- Date: Sat, 19 Mar 2022 21:37:07 GMT
- Title: CNNs and Transformers Perceive Hybrid Images Similar to Humans
- Authors: Ali Borji
- Abstract summary: We show that predictions of deep learning vision models qualitatively matches with the human perception of hybrid images.
Our results provide yet another evidence in support of the hypothesis that Convolutional Neural Networks (CNNs) and Transformers are good at modeling the feedforward sweep of information in the ventral stream of visual cortex.
- Score: 47.64219291655723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hybrid images is a technique to generate images with two interpretations that
change as a function of viewing distance. It has been utilized to study
multiscale processing of images by the human visual system. Using 63,000 hybrid
images across 10 fruit categories, here we show that predictions of deep
learning vision models qualitatively matches with the human perception of these
images. Our results provide yet another evidence in support of the hypothesis
that Convolutional Neural Networks (CNNs) and Transformers are good at modeling
the feedforward sweep of information in the ventral stream of visual cortex.
Code and data is available at https://github.com/aliborji/hybrid_images.git.
Related papers
- A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification [5.904095466127043]
We introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification.
Our model achieves state-of-the-art predictive performance compared to both black-box and interpretable models.
arXiv Detail & Related papers (2025-04-11T12:15:22Z) - Guided Diffusion for the Extension of Machine Vision to Human Visual Perception [0.0]
We propose a method for extending machine vision to human visual perception using guided diffusion.
Guided diffusion acts as a bridge between machine vision and human perception, enabling transitions between them without any additional overhead.
arXiv Detail & Related papers (2025-03-23T03:04:26Z) - Sensitive Image Classification by Vision Transformers [1.9598097298813262]
Vision transformer models leverage a self-attention mechanism to capture global interactions among contextual local elements.
In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models.
The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities.
arXiv Detail & Related papers (2024-12-21T02:34:24Z) - Inverting Transformer-based Vision Models [0.8124699127636158]
We apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer.
Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models.
arXiv Detail & Related papers (2024-12-09T14:43:06Z) - Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting
the Variation in Human Signals during Visuo-Linguistic Processes [4.518404103861656]
We study the nature of variation in visuo-linguistic signals, and find that they correlate with each other.
Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation.
Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
arXiv Detail & Related papers (2024-02-02T12:11:16Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - CIFAKE: Image Classification and Explainable Identification of
AI-Generated Synthetic Images [7.868449549351487]
This article proposes to enhance our ability to recognise AI-generated images through computer vision.
The two sets of data present as a binary classification problem with regard to whether the photograph is real or generated by AI.
This study proposes the use of a Convolutional Neural Network (CNN) to classify the images into two categories; Real or Fake.
arXiv Detail & Related papers (2023-03-24T16:33:06Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Prune and distill: similar reformatting of image information along rat
visual cortex and deep neural networks [61.60177890353585]
Deep convolutional neural networks (CNNs) have been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex.
Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex.
We show that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
arXiv Detail & Related papers (2022-05-27T08:06:40Z) - Superpixel-based Domain-Knowledge Infusion in Computer Vision [0.7349727826230862]
Superpixels are higher-order perceptual groups of pixels in an image, often carrying much more information than raw pixels.
There is an inherent relational structure to the relationship among different superpixels of an image.
This relational information can convey some form of domain information about the image, e.g. relationship between superpixels representing two eyes in a cat image.
arXiv Detail & Related papers (2021-05-20T01:25:42Z) - Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose.
Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification.
We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z) - A Psychophysically Oriented Saliency Map Prediction Model [4.884688557957589]
We propose a new psychophysical saliency prediction architecture, WECSF, inspired by multi-channel model of visual cortex functioning in humans.
The proposed model is evaluated using several datasets, including the MIT1003, MIT300, Toronto, SID4VAM, and UCF Sports datasets.
Our model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images and dynamic videos.
arXiv Detail & Related papers (2020-11-08T20:58:05Z) - Self-Supervised Linear Motion Deblurring [112.75317069916579]
Deep convolutional neural networks are state-of-the-art for image deblurring.
We present a differentiable reblur model for self-supervised motion deblurring.
Our experiments demonstrate that self-supervised single image deblurring is really feasible.
arXiv Detail & Related papers (2020-02-10T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.