Predicting Word Learning in Children from the Performance of Computer
Vision Systems
- URL: http://arxiv.org/abs/2207.09847v3
- Date: Sat, 9 Sep 2023 08:33:37 GMT
- Title: Predicting Word Learning in Children from the Performance of Computer
Vision Systems
- Authors: Sunayana Rane, Mira L. Nencheva, Zeyu Wang, Casey Lew-Williams, Olga
Russakovsky, Thomas L. Griffiths
- Abstract summary: We show that the age at which children acquire different categories of words is correlated with the performance of visual classification and captioning systems.
The performance of the computer vision systems is correlated with human judgments of the concreteness of words, which are in turn a predictor of children's word learning.
- Score: 24.49899952381515
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: For human children as well as machine learning systems, a key challenge in
learning a word is linking the word to the visual phenomena it describes. We
explore this aspect of word learning by using the performance of computer
vision systems as a proxy for the difficulty of learning a word from visual
cues. We show that the age at which children acquire different categories of
words is correlated with the performance of visual classification and
captioning systems, over and above the expected effects of word frequency. The
performance of the computer vision systems is correlated with human judgments
of the concreteness of words, which are in turn a predictor of children's word
learning, suggesting that these models are capturing the relationship between
words and visual phenomena.
Related papers
- A model of early word acquisition based on realistic-scale audiovisual naming events [10.047470656294333]
We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input.
We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that learns from statistical regularities in raw speech and pixel-level visual input.
Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants.
arXiv Detail & Related papers (2024-06-07T21:05:59Z) - Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic
Play [8.164232628099619]
We propose a computational model of visual representation learning during dyadic play.
We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition.
arXiv Detail & Related papers (2023-12-07T08:18:40Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - MEWL: Few-shot multimodal word learning with referential uncertainty [24.94171567232573]
We introduce the MachinE Word Learning benchmark to assess how machines learn word meaning in grounded visual scenes.
MEWL covers human's core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning.
By evaluating multimodal and unimodal agents' performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning.
arXiv Detail & Related papers (2023-06-01T09:54:31Z) - Cross-Modal Alignment Learning of Vision-Language Conceptual Systems [24.423011687551433]
We propose methods for learning aligned vision-language conceptual systems inspired by infants' word learning mechanisms.
The proposed model learns the associations of visual objects and words online and gradually constructs cross-modal relational graph networks.
arXiv Detail & Related papers (2022-07-31T08:39:53Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - A Computational Model of Early Word Learning from the Infant's Point of
View [15.443815646555125]
The present study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents.
We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch.
As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved.
arXiv Detail & Related papers (2020-06-04T12:08:44Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.