Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting
the Variation in Human Signals during Visuo-Linguistic Processes
- URL: http://arxiv.org/abs/2402.01352v1
- Date: Fri, 2 Feb 2024 12:11:16 GMT
- Title: Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting
the Variation in Human Signals during Visuo-Linguistic Processes
- Authors: Ece Takmaz, Sandro Pezzelle, Raquel Fern\'andez
- Abstract summary: We study the nature of variation in visuo-linguistic signals, and find that they correlate with each other.
Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation.
Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
- Score: 4.518404103861656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an intricate relation between the properties of an image and how
humans behave while describing the image. This behavior shows ample variation,
as manifested in human signals such as eye movements and when humans start to
describe the image. Despite the value of such signals of visuo-linguistic
variation, they are virtually disregarded in the training of current pretrained
models, which motivates further investigation. Using a corpus of Dutch image
descriptions with concurrently collected eye-tracking data, we explore the
nature of the variation in visuo-linguistic signals, and find that they
correlate with each other. Given this result, we hypothesize that variation
stems partly from the properties of the images, and explore whether image
representations encoded by pretrained vision encoders can capture such
variation. Our results indicate that pretrained models do so to a
weak-to-moderate degree, suggesting that the models lack biases about what
makes a stimulus complex for humans and what leads to variations in human
outputs.
Related papers
- Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Evaluating Vision-Language Models on Bistable Images [34.492117496933915]
This study is the most extensive examination of vision-language models using bistable images to date.
We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation.
Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another.
arXiv Detail & Related papers (2024-05-29T18:04:59Z) - Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images [34.02058539403381]
We leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection.
A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images.
arXiv Detail & Related papers (2024-03-13T19:56:30Z) - Multi-Domain Norm-referenced Encoding Enables Data Efficient Transfer
Learning of Facial Expression Recognition [62.997667081978825]
We propose a biologically-inspired mechanism for transfer learning in facial expression recognition.
Our proposed architecture provides an explanation for how the human brain might innately recognize facial expressions on varying head shapes.
Our model achieves a classification accuracy of 92.15% on the FERG dataset with extreme data efficiency.
arXiv Detail & Related papers (2023-04-05T09:06:30Z) - An Extended Study of Human-like Behavior under Adversarial Training [11.72025865314187]
We show that adversarial training increases the shift toward shape bias in neural networks.
We also provide a possible explanation for this phenomenon from a frequency perspective.
arXiv Detail & Related papers (2023-03-22T15:47:16Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain.
This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation.
We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z) - Image-to-image Transformation with Auxiliary Condition [0.0]
We propose to introduce the label information of subjects, e.g., pose and type of objects in the training of CycleGAN, and lead it to obtain label-wise transforamtion models.
We evaluate our proposed method called Label-CycleGAN, through experiments on the digit image transformation from SVHN to MNIST and the surveillance camera image transformation from simulated to real images.
arXiv Detail & Related papers (2021-06-25T15:33:11Z) - Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose.
Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification.
We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z) - Adversarial Semantic Data Augmentation for Human Pose Estimation [96.75411357541438]
We propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity.
We also propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration.
State-of-the-art results are achieved on challenging benchmarks.
arXiv Detail & Related papers (2020-08-03T07:56:04Z) - Self-Supervised Linear Motion Deblurring [112.75317069916579]
Deep convolutional neural networks are state-of-the-art for image deblurring.
We present a differentiable reblur model for self-supervised motion deblurring.
Our experiments demonstrate that self-supervised single image deblurring is really feasible.
arXiv Detail & Related papers (2020-02-10T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.