Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic
Play
- URL: http://arxiv.org/abs/2312.04118v2
- Date: Wed, 17 Jan 2024 09:43:14 GMT
- Title: Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic
Play
- Authors: Timothy Schauml\"offel, Arthur Aubret, Gemma Roig, Jochen Triesch
- Abstract summary: We propose a computational model of visual representation learning during dyadic play.
We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition.
- Score: 8.164232628099619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Infants' ability to recognize and categorize objects develops gradually. The
second year of life is marked by both the emergence of more semantic visual
representations and a better understanding of word meaning. This suggests that
language input may play an important role in shaping visual representations.
However, even in suitable contexts for word learning like dyadic play sessions,
caregivers utterances are sparse and ambiguous, often referring to objects that
are different from the one to which the child attends. Here, we systematically
investigate to what extent caregivers' utterances can nevertheless enhance
visual representations. For this we propose a computational model of visual
representation learning during dyadic play. We introduce a synthetic dataset of
ego-centric images perceived by a toddler-agent that moves and rotates toy
objects in different parts of its home environment while hearing caregivers'
utterances, modeled as captions. We propose to model toddlers' learning as
simultaneously aligning representations for 1) close-in-time images and 2)
co-occurring images and utterances. We show that utterances with statistics
matching those of real caregivers give rise to representations supporting
improved category recognition. Our analysis reveals that a small
decrease/increase in object-relevant naming frequencies can drastically impact
the learned representations. This affects the attention on object names within
an utterance, which is required for efficient visuo-linguistic alignment.
Overall, our results support the hypothesis that caregivers' naming utterances
can improve toddlers' visual representations.
Related papers
- Active Gaze Behavior Boosts Self-Supervised Object Learning [4.612042044544857]
We study whether a bio inspired visual learning model can harness toddlers' gaze behavior during a play session to develop view-invariant object recognition.
Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations.
Overall, our work reveals how toddlers' gaze behavior supports self-supervised learning of view-invariant object recognition.
arXiv Detail & Related papers (2024-11-04T10:44:46Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - A model of early word acquisition based on realistic-scale audiovisual naming events [10.047470656294333]
We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input.
We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that learns from statistical regularities in raw speech and pixel-level visual input.
Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants.
arXiv Detail & Related papers (2024-06-07T21:05:59Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Predicting Word Learning in Children from the Performance of Computer
Vision Systems [24.49899952381515]
We show that the age at which children acquire different categories of words is correlated with the performance of visual classification and captioning systems.
The performance of the computer vision systems is correlated with human judgments of the concreteness of words, which are in turn a predictor of children's word learning.
arXiv Detail & Related papers (2022-07-07T22:49:32Z) - Embodied vision for learning object representations [4.211128681972148]
We show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments.
We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions.
arXiv Detail & Related papers (2022-05-12T16:36:27Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Using Diachronic Distributed Word Representations as Models of Lexical
Development in Children [0.0]
We use diachronic distributed word representations to perform temporal modeling and analysis of lexical development in children.
We demonstrate the dynamics of growing lexical knowledge in children over time, as compared against a saturated level of lexical knowledge in child-directed adult speech.
arXiv Detail & Related papers (2021-05-11T14:44:05Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.