Partial success in closing the gap between human and machine vision
- URL: http://arxiv.org/abs/2106.07411v1
- Date: Mon, 14 Jun 2021 13:23:35 GMT
- Title: Partial success in closing the gap between human and machine vision
- Authors: Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian
Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel
- Abstract summary: A few years ago, the first CNN surpassed human performance on ImageNet.
Here we ask: Are we making progress in closing the gap between human and machine vision?
We tested human observers on a broad range of out-of-distribution (OOD) datasets.
- Score: 30.78663978510427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A few years ago, the first CNN surpassed human performance on ImageNet.
However, it soon became clear that machines lack robustness on more challenging
test cases, a major obstacle towards deploying machines "in the wild" and
towards obtaining better computational models of human visual perception. Here
we ask: Are we making progress in closing the gap between human and machine
vision? To answer this question, we tested human observers on a broad range of
out-of-distribution (OOD) datasets, adding the "missing human baseline" by
recording 85,120 psychophysical trials across 90 participants. We then
investigated a range of promising machine learning developments that crucially
deviate from standard supervised CNNs along three axes: objective function
(self-supervised, adversarially trained, CLIP language-image training),
architecture (e.g. vision transformers), and dataset size (ranging from 1M to
1B). Our findings are threefold. (1.) The longstanding robustness gap between
humans and CNNs is closing, with the best models now matching or exceeding
human performance on most OOD datasets. (2.) There is still a substantial
image-level consistency gap, meaning that humans make different errors than
models. In contrast, most models systematically agree in their categorisation
errors, even substantially different ones like contrastive self-supervised vs.
standard supervised models. (3.) In many cases, human-to-model consistency
improves when training dataset size is increased by one to three orders of
magnitude. Our results give reason for cautious optimism: While there is still
much room for improvement, the behavioural difference between human and machine
vision is narrowing. In order to measure future progress, 17 OOD datasets with
image-level human behavioural data are provided as a benchmark here:
https://github.com/bethgelab/model-vs-human/
Related papers
- Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Sapiens: Foundation for Human Vision Models [14.72839332332364]
We present Sapiens, a family of models for four fundamental human-centric vision tasks.
Our models support 1K high-resolution inference and are easy to adapt for individual tasks.
We observe that self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks.
arXiv Detail & Related papers (2024-08-22T17:37:27Z) - Cross-view and Cross-pose Completion for 3D Human Understanding [22.787947086152315]
We propose a pre-training approach based on self-supervised learning that works on human-centric data using only images.
We pre-train a model for body-centric tasks and one for hand-centric tasks.
With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks.
arXiv Detail & Related papers (2023-11-15T16:51:18Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - What's "up" with vision-language models? Investigating their struggle
with spatial reasoning [76.2406963762722]
Three new corpora quantify model comprehension of basic spatial relations.
We evaluate 18 vision-language (VL) models, finding that all perform poorly.
We conclude by studying causes of this surprising behavior.
arXiv Detail & Related papers (2023-10-30T17:50:15Z) - Human alignment of neural network representations [22.671101285994013]
We investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses.
We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses.
We find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not.
arXiv Detail & Related papers (2022-11-02T15:23:16Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Are Convolutional Neural Networks or Transformers more like human
vision? [9.83454308668432]
We show that attention-based networks can achieve higher accuracy than CNNs on vision tasks.
These results have implications both for building more human-like vision models, as well as for understanding visual object recognition in humans.
arXiv Detail & Related papers (2021-05-15T10:33:35Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - On the surprising similarities between supervised and self-supervised
models [29.04088957917865]
We compare self-supervised networks to supervised models and human behaviour.
Current self-supervised CNNs share four key characteristics of their supervised counterparts.
We are hopeful that future self-supervised models behave differently from supervised ones.
arXiv Detail & Related papers (2020-10-16T13:28:13Z) - Cascaded deep monocular 3D human pose estimation with evolutionary
training data [76.3478675752847]
Deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation.
This paper proposes a novel data augmentation method that is scalable for massive amount of training data.
Our method synthesizes unseen 3D human skeletons based on a hierarchical human representation and synthesizings inspired by prior knowledge.
arXiv Detail & Related papers (2020-06-14T03:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.