Self-supervised learning of video representations from a child's perspective
- URL: http://arxiv.org/abs/2402.00300v3
- Date: Wed, 16 Oct 2024 19:10:03 GMT
- Title: Self-supervised learning of video representations from a child's perspective
- Authors: A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake,
- Abstract summary: Children learn powerful internal models of the world around them from a few years of egocentric visual experience.
Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases?
- Score: 27.439294457852423
- License:
- Abstract: Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more accurate and more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child's internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.
Related papers
- The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences [8.952954042940368]
We release the largest developmental egocentric video dataset to date -- the BabyView dataset.
This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment.
We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation.
arXiv Detail & Related papers (2024-06-14T23:52:27Z) - A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text.
Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z) - Learning high-level visual representations from a child's perspective
without strong inductive biases [21.466000613898988]
We train state-of-the-art neural networks on a realistic proxy of a child's visual experience without explicit supervision.
We train both embedding models and generative models on 200 hours of headcam video from a single child.
Generative models trained with the same data successfully extrapolate simple properties of partially masked objects.
arXiv Detail & Related papers (2023-05-24T17:26:59Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - iBoot: Image-bootstrapped Self-Supervised Video Representation Learning [45.845595749486215]
Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
arXiv Detail & Related papers (2022-06-16T17:42:48Z) - Embodied vision for learning object representations [4.211128681972148]
We show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments.
We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions.
arXiv Detail & Related papers (2022-05-12T16:36:27Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Curious Representation Learning for Embodied Intelligence [81.21764276106924]
Self-supervised representation learning has achieved remarkable success in recent years.
Yet to build truly intelligent agents, we must construct representation learning algorithms that can learn from environments.
We propose a framework, curious representation learning, which jointly learns a reinforcement learning policy and a visual representation model.
arXiv Detail & Related papers (2021-05-03T17:59:20Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - A Computational Model of Early Word Learning from the Infant's Point of
View [15.443815646555125]
The present study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents.
We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch.
As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved.
arXiv Detail & Related papers (2020-06-04T12:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.