Learning high-level visual representations from a child's perspective
without strong inductive biases
- URL: http://arxiv.org/abs/2305.15372v2
- Date: Fri, 22 Sep 2023 17:41:47 GMT
- Title: Learning high-level visual representations from a child's perspective
without strong inductive biases
- Authors: A. Emin Orhan, Brenden M. Lake
- Abstract summary: We train state-of-the-art neural networks on a realistic proxy of a child's visual experience without explicit supervision.
We train both embedding models and generative models on 200 hours of headcam video from a single child.
Generative models trained with the same data successfully extrapolate simple properties of partially masked objects.
- Score: 21.466000613898988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Young children develop sophisticated internal models of the world based on
their visual experience. Can such models be learned from a child's visual
experience without strong inductive biases? To investigate this, we train
state-of-the-art neural networks on a realistic proxy of a child's visual
experience without any explicit supervision or domain-specific inductive
biases. Specifically, we train both embedding models and generative models on
200 hours of headcam video from a single child collected over two years and
comprehensively evaluate their performance in downstream tasks using various
reference models as yardsticks. On average, the best embedding models perform
at a respectable 70% of a high-performance ImageNet-trained model, despite
substantial differences in training data. They also learn broad semantic
categories and object localization capabilities without explicit supervision,
but they are less object-centric than models trained on all of ImageNet.
Generative models trained with the same data successfully extrapolate simple
properties of partially masked objects, like their rough outline, texture,
color, or orientation, but struggle with finer object details. We replicate our
experiments with two other children and find remarkably consistent results.
Broadly useful high-level visual representations are thus robustly learnable
from a representative sample of a child's visual experience without strong
inductive biases.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Self-supervised learning of video representations from a child's perspective [27.439294457852423]
Children learn powerful internal models of the world around them from a few years of egocentric visual experience.
Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases?
arXiv Detail & Related papers (2024-02-01T03:27:26Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z) - Revisiting Weakly Supervised Pre-Training of Visual Perception Models [27.95816470075203]
Large-scale weakly supervised pre-training can outperform fully supervised approaches.
This paper revisits weakly-supervised pre-training of models using hashtag supervision.
Our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.
arXiv Detail & Related papers (2022-01-20T18:55:06Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets.
We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z) - A Computational Model of Early Word Learning from the Infant's Point of
View [15.443815646555125]
The present study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents.
We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch.
As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved.
arXiv Detail & Related papers (2020-06-04T12:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.