Learning word-referent mappings and concepts from raw inputs
- URL: http://arxiv.org/abs/2003.05573v1
- Date: Thu, 12 Mar 2020 02:18:19 GMT
- Title: Learning word-referent mappings and concepts from raw inputs
- Authors: Wai Keen Vong, Brenden M. Lake
- Abstract summary: We present a neural network model trained from scratch via self-supervision that takes in raw images and words as inputs.
The model generalizes to novel word instances, locates referents of words in a scene, and shows a preference for mutual exclusivity.
- Score: 18.681222155879656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How do children learn correspondences between the language and the world from
noisy, ambiguous, naturalistic input? One hypothesis is via cross-situational
learning: tracking words and their possible referents across multiple
situations allows learners to disambiguate correct word-referent mappings (Yu &
Smith, 2007). However, previous models of cross-situational word learning
operate on highly simplified representations, side-stepping two important
aspects of the actual learning problem. First, how can word-referent mappings
be learned from raw inputs such as images? Second, how can these learned
mappings generalize to novel instances of a known word? In this paper, we
present a neural network model trained from scratch via self-supervision that
takes in raw images and words as inputs, and show that it can learn
word-referent mappings from fully ambiguous scenes and utterances through
cross-situational learning. In addition, the model generalizes to novel word
instances, locates referents of words in a scene, and shows a preference for
mutual exclusivity.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Learning the meanings of function words from grounded language using a visual question answering model [28.10687343493772]
We show that recent neural-network based visual question answering models can learn to use function words as part of answering questions about complex visual scenes.
We find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning.
Our findings offer proof-of-concept evidence that it is possible to learn the nuanced interpretations of function words in visually grounded context.
arXiv Detail & Related papers (2023-08-16T18:53:39Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Towards a Theoretical Understanding of Word and Relation Representation [8.020742121274418]
Representing words by vectors, or embeddings, enables computational reasoning.
We focus on word embeddings learned from text corpora and knowledge graphs.
arXiv Detail & Related papers (2022-02-01T15:34:58Z) - Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding [59.8167502322261]
We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture.
The embedding of each word from the query sentence is treated alike by attending to visual pixels individually.
The proposed Word2Pix outperforms existing one-stage methods by a notable margin.
arXiv Detail & Related papers (2021-07-31T10:20:15Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Using Holographically Compressed Embeddings in Question Answering [0.0]
This research employs holographic compression of pre-trained embeddings to represent a token, its part-of-speech, and named entity type.
The implementation, in a modified question answering recurrent deep learning network, shows that semantic relationships are preserved, and yields strong performance.
arXiv Detail & Related papers (2020-07-14T18:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.