Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
- URL: http://arxiv.org/abs/2312.02843v1
- Date: Tue, 5 Dec 2023 15:53:24 GMT
- Title: Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
- Authors: Lalit Pandey, Samantha M. W. Wood, Justin N. Wood
- Abstract summary: Vision transformers (ViTs) are top performing models on many computer vision benchmarks.
ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance.
We directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Vision transformers (ViTs) are top performing models on many computer vision
benchmarks and can accurately predict human behavior on object recognition
tasks. However, researchers question the value of using ViTs as models of
biological learning because ViTs are thought to be more data hungry than
brains, with ViTs requiring more training data to reach similar levels of
performance. To test this assumption, we directly compared the learning
abilities of ViTs and animals, by performing parallel controlled rearing
experiments on ViTs and newborn chicks. We first raised chicks in impoverished
visual environments containing a single object, then simulated the training
data available in those environments by building virtual animal chambers in a
video game engine. We recorded the first-person images acquired by agents
moving through the virtual chambers and used those images to train self
supervised ViTs that leverage time as a teaching signal, akin to biological
visual systems. When ViTs were trained through the eyes of newborn chicks, the
ViTs solved the same view invariant object recognition tasks as the chicks.
Thus, ViTs were not more data hungry than newborn visual systems: both learned
view invariant object representations in impoverished visual environments. The
flexible and generic attention based learning mechanism in ViTs combined with
the embodied data streams available to newborn animals appears sufficient to
drive the development of animal-like object recognition.
Related papers
- Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - A newborn embodied Turing test for view-invariant object recognition [0.0]
We present a "newborn embodied Turing Test" that allows newborn animals and machines to be raised in the same environments and tested with the same tasks.
To make this platform, we first collected controlled-rearing data from newborn chicks, then performed "digital twin" experiments in which machines were raised in virtual environments that mimicked the rearing conditions of the chicks.
arXiv Detail & Related papers (2023-06-08T22:46:31Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.