Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
- URL: http://arxiv.org/abs/2302.00667v2
- Date: Tue, 01 Oct 2024 16:29:14 GMT
- Title: Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
- Authors: Tatsuki Kuribayashi, Timothy Baldwin,
- Abstract summary: This study explores the advantage of grounded language acquisition, specifically the impact of visual information on syntactic generalization in Neural language models (LMs)
Our experiments show that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help.
This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.
- Score: 32.9355090864485
- License:
- Abstract: Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information -- which humans can usually rely on but LMs largely do not have access to during language acquisition -- on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.
Related papers
- Exploring Spatial Schema Intuitions in Large Language and Vision Models [8.944921398608063]
We investigate whether large language models (LLMs) effectively capture implicit human intuitions about building blocks of language.
Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences.
This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and computations made by large language models.
arXiv Detail & Related papers (2024-02-01T19:25:50Z) - Divergences between Language Models and Human Brains [63.405788999891335]
Recent research has hinted that brain signals can be effectively predicted using internal representations of language models (LMs)
We show that there are clear differences in how LMs and humans represent and use language.
We identify two domains that are not captured well by LMs: social/emotional intelligence and physical commonsense.
arXiv Detail & Related papers (2023-11-15T19:02:40Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Tackling Vision Language Tasks Through Learning Inner Monologues [10.795616787372625]
We propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems.
IMMO simulates inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves.
The results suggest IMMO can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models.
arXiv Detail & Related papers (2023-08-19T10:10:49Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - Context Limitations Make Neural Language Models More Human-Like [32.488137777336036]
We show discrepancies in context access between modern neural language models (LMs) and humans in incremental sentence processing.
Additional context limitation was needed to make LMs better simulate human reading behavior.
Our analyses also showed that human-LM gaps in memory access are associated with specific syntactic constructions.
arXiv Detail & Related papers (2022-05-23T17:01:13Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Presentation and Analysis of a Multimodal Dataset for Grounded Language
Learning [32.28310581819443]
Grounded language acquisition involves learning how language-based interactions refer to the world around them.
In practice the data used for learning tends to be cleaner, clearer, and more grammatical than actual human interactions.
We present a dataset of common household objects described by people using either spoken or written language.
arXiv Detail & Related papers (2020-07-29T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.