When Does Perceptual Alignment Benefit Vision Representations?
- URL: http://arxiv.org/abs/2410.10817v1
- Date: Mon, 14 Oct 2024 17:59:58 GMT
- Title: When Does Perceptual Alignment Benefit Vision Representations?
- Authors: Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Y. Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, Phillip Isola,
- Abstract summary: We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
- Score: 76.32336818860965
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability across diverse computer vision tasks. We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard vision benchmarks. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks, including counting, segmentation, depth estimation, instance retrieval, and retrieval-augmented generation. In addition, we find that performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
Related papers
- Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning [0.7999703756441756]
Human capabilities in understanding visual relations are far superior to those of AI systems.
We develop a system equipped with a novel Glimpse-based Active Perception (GAP)
The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content.
arXiv Detail & Related papers (2024-09-30T11:48:11Z) - Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives [3.418398936676879]
This work presents FRESCO, a framework designed to explore the socio-cultural implications of images on social media platforms at scale.
FRESCO deconstructs images into numerical and categorical variables using state-of-the-art computer vision techniques.
The framework analyzes images across three levels: the plastic level, encompassing fundamental visual features like lines and colors; the figurative level, representing specific entities or concepts; and the enunciation level, which focuses particularly on constructing the point of view of the spectator and observer.
arXiv Detail & Related papers (2024-07-03T16:57:38Z) - Dual Thinking and Perceptual Analysis of Deep Learning Models using Human Adversarial Examples [5.022336433202968]
The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ.
We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision.
Our study also addresses a major criticism of using classification models as computational models of human vision.
arXiv Detail & Related papers (2024-06-11T05:50:34Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain.
This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation.
We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.