Related papers: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

URL: http://arxiv.org/abs/2501.05205v4
Date: Tue, 25 Mar 2025 07:11:03 GMT
Title: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
Authors: Xueyi Ke, Satoshi Tsutsui, Yayun Zhang, Bihan Wen,
Abstract summary: As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights.<n>In this paper, we present an interdisciplinary study exploring this question.<n>Can a computational model that imitates the infant learning process develop broader visual concepts similar to how infants naturally learn?<n>Our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant visual and linguistic inputs.
Score: 18.43931715859825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infants develop complex visual understanding rapidly, even preceding the acquisition of linguistic skills. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We perform neuron labeling to identify visual concept neurons hidden in the model's internal representations. We then demonstrate that these neurons can recognize objects beyond the model's original vocabulary. Furthermore, we compare the differences in representation between infant models and those in modern computer vision models, such as CLIP and ImageNet pre-trained model. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant visual and linguistic inputs. Our code is available at https://github.com/Kexueyi/discover_infant_vis.

Related papers

Active Gaze Behavior Boosts Self-Supervised Object Learning [4.612042044544857]
We study whether a bio inspired visual learning model can harness toddlers' gaze behavior during a play session to develop view-invariant object recognition. Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations. Overall, our work reveals how toddlers' gaze behavior supports self-supervised learning of view-invariant object recognition.
arXiv Detail & Related papers (2024-11-04T10:44:46Z)
Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics [8.749640179057469]
We use linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
arXiv Detail & Related papers (2024-10-31T03:37:21Z)
A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text. Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z)
MIMo: A Multi-Modal Infant Model for Studying Cognitive Development [3.5009119465343033]
We present MIMo, an open-source infant model for studying early cognitive development through computer simulations. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin.
arXiv Detail & Related papers (2023-12-07T14:21:31Z)
Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play [8.164232628099619]
We propose a computational model of visual representation learning during dyadic play. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition.
arXiv Detail & Related papers (2023-12-07T08:18:40Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z)
Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions. We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems. In training, it enriches entities in natural language with WordNet and Wiktionary knowledge. In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)
What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
A Computational Model of Early Word Learning from the Infant's Point of View [15.443815646555125]
The present study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved.
arXiv Detail & Related papers (2020-06-04T12:08:44Z)
A Developmental Neuro-Robotics Approach for Boosting the Recognition of Handwritten Digits [91.3755431537592]
Recent evidence shows that a simulation of the children's embodied strategies can improve the machine intelligence too. This article explores the application of embodied strategies to convolutional neural network models in the context of developmental neuro-robotics.
arXiv Detail & Related papers (2020-03-23T14:55:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.