Improving generalization by mimicking the human visual diet
- URL: http://arxiv.org/abs/2206.07802v2
- Date: Wed, 10 Jan 2024 15:48:39 GMT
- Title: Improving generalization by mimicking the human visual diet
- Authors: Spandan Madan, You Li, Mengmi Zhang, Hanspeter Pfister, Gabriel
Kreiman
- Abstract summary: We present a new perspective on bridging the generalization gap between biological and computer vision.
Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations.
- Score: 34.32585612888424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new perspective on bridging the generalization gap between
biological and computer vision -- mimicking the human visual diet. While
computer vision models rely on internet-scraped datasets, humans learn from
limited 3D scenes under diverse real-world transformations with objects in
natural context. Our results demonstrate that incorporating variations and
contextual cues ubiquitous in the human visual training data (visual diet)
significantly improves generalization to real-world transformations such as
lighting, viewpoint, and material changes. This improvement also extends to
generalizing from synthetic to real-world data -- all models trained with a
human-like visual diet outperform specialized architectures by large margins
when tested on natural image data. These experiments are enabled by our two key
contributions: a novel dataset capturing scene context and diverse real-world
transformations to mimic the human visual diet, and a transformer model
tailored to leverage these aspects of the human visual diet. All data and
source code can be accessed at
https://github.com/Spandan-Madan/human_visual_diet.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Grounding Visual Illusions in Language: Do Vision-Language Models
Perceive Illusions Like Humans? [28.654771227396807]
Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world.
Do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality?
We build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.
arXiv Detail & Related papers (2023-10-31T18:01:11Z) - Extreme Image Transformations Affect Humans and Machines Differently [0.0]
Some recent artificial neural networks (ANNs) claim to model aspects of primate neural and human performance data.
We introduce a set of novel image transforms inspired by neurophysiological findings and evaluate humans and ANNs on an object recognition task.
We show that machines perform better than humans for certain transforms and struggle to perform at par with humans on others that are easy for humans.
arXiv Detail & Related papers (2022-11-30T18:12:53Z) - Human alignment of neural network representations [22.671101285994013]
We investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses.
We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses.
We find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not.
arXiv Detail & Related papers (2022-11-02T15:23:16Z) - HSPACE: Synthetic Parametric Humans Animated in Complex Environments [67.8628917474705]
We build a large-scale photo-realistic dataset, Human-SPACE, of animated humans placed in complex indoor and outdoor environments.
We combine a hundred diverse individuals of varying ages, gender, proportions, and ethnicity, with hundreds of motions and scenes, in order to generate an initial dataset of over 1 million frames.
Assets are generated automatically, at scale, and are compatible with existing real time rendering and game engines.
arXiv Detail & Related papers (2021-12-23T22:27:55Z) - Style and Pose Control for Image Synthesis of Humans from a Single
Monocular View [78.6284090004218]
StylePoseGAN is a non-controllable generator to accept conditioning of pose and appearance separately.
Our network can be trained in a fully supervised way with human images to disentangle pose, appearance and body parts.
StylePoseGAN achieves state-of-the-art image generation fidelity on common perceptual metrics.
arXiv Detail & Related papers (2021-02-22T18:50:47Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Methodology for Building Synthetic Datasets with Virtual Humans [1.5556923898855324]
Large datasets can be used for improved, targeted training of deep neural networks.
In particular, we make use of a 3D morphable face model for the rendering of multiple 2D images across a dataset of 100 synthetic identities.
arXiv Detail & Related papers (2020-06-21T10:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.