Learning the meanings of function words from grounded language using a visual question answering model
- URL: http://arxiv.org/abs/2308.08628v3
- Date: Mon, 22 Apr 2024 19:00:51 GMT
- Title: Learning the meanings of function words from grounded language using a visual question answering model
- Authors: Eva Portelance, Michael C. Frank, Dan Jurafsky,
- Abstract summary: We show that recent neural-network based visual question answering models can learn to use function words as part of answering questions about complex visual scenes.
We find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning.
Our findings offer proof-of-concept evidence that it is possible to learn the nuanced interpretations of function words in visually grounded context.
- Score: 28.10687343493772
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spatial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning, as well as early evidence that they are sensitive to alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer proof-of-concept evidence that it is possible to learn the nuanced interpretations of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.
Related papers
- Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models [31.006803764376475]
Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words.
Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning.
Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously.
arXiv Detail & Related papers (2024-06-17T18:01:06Z) - A model of early word acquisition based on realistic-scale audiovisual naming events [10.047470656294333]
We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input.
We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that learns from statistical regularities in raw speech and pixel-level visual input.
Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants.
arXiv Detail & Related papers (2024-06-07T21:05:59Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Why can neural language models solve next-word prediction? A
mathematical perspective [53.807657273043446]
We study a class of formal languages that can be used to model real-world examples of English sentences.
Our proof highlights the different roles of the embedding layer and the fully connected component within the neural language model.
arXiv Detail & Related papers (2023-06-20T10:41:23Z) - Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic
Complexity in Verb Acquisition [8.183763443800348]
We employ visual and linguistic representations of words sourced from pre-trained artificial neural networks.
We find that the representation of verbs is generally more variable and less discriminable within domain than the representation of nouns.
Visual variability is the strongest factor that internally drives verb learning, followed by visual-linguistic alignment and linguistic variability.
arXiv Detail & Related papers (2023-04-05T15:08:21Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Seeing the advantage: visually grounding word embeddings to better
capture human semantic knowledge [8.208534667678792]
Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks.
We create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods.
Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times than the purely text-based embeddings.
arXiv Detail & Related papers (2022-02-21T15:13:48Z) - Provable Limitations of Acquiring Meaning from Ungrounded Form: What
will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning.
We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence.
We find that assertions enable semantic emulation if all expressions in the language are referentially transparent.
However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z) - What is Learned in Visually Grounded Neural Syntax Acquisition [118.6461386981381]
We consider the case study of the Visually Grounded Neural Syntax Learner.
By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance.
We find that a simple lexical signal of noun concreteness plays the main role in the model's predictions.
arXiv Detail & Related papers (2020-05-04T17:32:20Z) - Learning word-referent mappings and concepts from raw inputs [18.681222155879656]
We present a neural network model trained from scratch via self-supervision that takes in raw images and words as inputs.
The model generalizes to novel word instances, locates referents of words in a scene, and shows a preference for mutual exclusivity.
arXiv Detail & Related papers (2020-03-12T02:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.