Related papers: The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

URL: http://arxiv.org/abs/2406.10447v1
Date: Fri, 14 Jun 2024 23:52:27 GMT
Title: The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences
Authors: Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman, Daniel L. K. Yamins, Michael C. Frank,
Abstract summary: We release the largest developmental egocentric video dataset to date -- the BabyView dataset. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation.
Score: 8.952954042940368
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

Related papers

On the robustness of modeling grounded word learning through a child's egocentric input [9.62675241698235]
We show that multimodal neural networks trained on automatically transcribed data from each child can acquire and generalize word-referent mappings across multiple network architectures.<n>Results validate the robustness of multimodal neural networks for grounded word learning.
arXiv Detail & Related papers (2025-07-19T20:55:37Z)
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning [47.451445173060094]
Human infants rapidly develop visual reasoning skills from minimal input. Recent efforts have leveraged infant-inspired datasets like SAYCam. We propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset.
arXiv Detail & Related papers (2025-04-13T04:17:12Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck. Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
What Are You Doing? A Closer Look at Controllable Human Video Generation [73.89117620413724]
What Are You Doing?' is a new benchmark for evaluation of controllable image-to-video generation of humans. It consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. We perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation.
arXiv Detail & Related papers (2025-03-06T17:59:29Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech. We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets. These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z)
The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning [3.649801602551928]
We develop a set of neuroscience-inspired self-supervised objectives, together with a neural architecture, for representation learning from heterogeneous recordings. Results show that representations learned with these objectives scale with data, generalise across subjects, datasets, and tasks, and surpass comparable self-supervised approaches.
arXiv Detail & Related papers (2024-06-06T17:59:09Z)
Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection. Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z)
Self-supervised learning of video representations from a child's perspective [27.439294457852423]
Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases?
arXiv Detail & Related papers (2024-02-01T03:27:26Z)
Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering" We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z)
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML) Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant. Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.