Related papers: Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Assessing the alignment between infants' visual and linguistic experience using multimodal language models

URL: http://arxiv.org/abs/2511.18824v1
Date: Mon, 24 Nov 2025 06:58:16 GMT
Title: Assessing the alignment between infants' visual and linguistic experience using multimodal language models
Authors: Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z. Sparks, Zi Yin, Virginia A. Marchman, Michael C. Frank, Bria Long,
Abstract summary: How aligned in time are children's visual and linguistic experiences during everyday learning?<n>We show that idealized aligned moments for learning are relatively rare in children's everyday experiences compared to modern machine learning datasets.<n>These findings suggest that infrequent alignment is a constraint for models describing early word learning.
Score: 2.275358921334511
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.

Related papers

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning [18.43931715859825]
As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights.<n>In this paper, we present an interdisciplinary study exploring this question.<n>We analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child.<n>We demonstrate that these neurons can recognize objects beyond the model's original vocabulary.
arXiv Detail & Related papers (2025-01-09T12:55:55Z)
A model of early word acquisition based on realistic-scale audiovisual naming events [10.047470656294333]
We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that learns from statistical regularities in raw speech and pixel-level visual input. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants.
arXiv Detail & Related papers (2024-06-07T21:05:59Z)
Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z)
Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play [8.164232628099619]
We propose a computational model of visual representation learning during dyadic play. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition.
arXiv Detail & Related papers (2023-12-07T08:18:40Z)
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels. We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Word Acquisition in Neural Language Models [0.38073142980733]
We investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words. We find that the effects of concreteness, word length, and lexical class are pointedly different in children and language models.
arXiv Detail & Related papers (2021-10-05T23:26:16Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
A Computational Model of Early Word Learning from the Infant's Point of View [15.443815646555125]
The present study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved.
arXiv Detail & Related papers (2020-06-04T12:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.