World-to-Words: Grounded Open Vocabulary Acquisition through Fast
Mapping in Vision-Language Models
- URL: http://arxiv.org/abs/2306.08685v1
- Date: Wed, 14 Jun 2023 18:10:05 GMT
- Title: World-to-Words: Grounded Open Vocabulary Acquisition through Fast
Mapping in Vision-Language Models
- Authors: Ziqiao Ma, Jiayi Pan, Joyce Chai
- Abstract summary: We introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning.
We propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective.
We demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly.
- Score: 6.47452771256903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to connect language units to their referents in the physical
world, referred to as grounding, is crucial to learning and understanding
grounded meanings of words. While humans demonstrate fast mapping in new word
learning, it remains unclear whether modern vision-language models can truly
represent language with their grounded meanings and how grounding may further
bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary
Acquisition (GOVA) to examine grounding and bootstrapping in open-world
language learning. As an initial attempt, we propose object-oriented BERT
(OctoBERT), a novel visually-grounded language model by pre-training on
image-text pairs highlighting grounding as an objective. Through extensive
experiments and analysis, we demonstrate that OctoBERT is a more coherent and
fast grounded word learner, and that the grounding ability acquired during
pre-training helps the model to learn unseen words more rapidly and robustly.
Our code is available at https://github.com/sled-group/world-to-words
Related papers
- Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Language with Vision: a Study on Grounded Word and Sentence Embeddings [6.231247903840833]
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations.
The present study proposes a computational grounding model for pre-trained word embeddings.
Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information.
arXiv Detail & Related papers (2022-06-17T15:04:05Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Word Discovery in Visually Grounded, Self-Supervised Speech Models [13.956691231452336]
We show that powerful word segmentation and clustering capability emerges within the model's self-attention heads.
Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models.
arXiv Detail & Related papers (2022-03-28T20:41:17Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Visual Grounding in Video for Unsupervised Word Translation [91.47607488740647]
We use visual grounding to improve unsupervised word mapping between languages.
We learn embeddings from unpaired instructional videos narrated in the native language.
We apply these methods to translate words from English to French, Korean, and Japanese.
arXiv Detail & Related papers (2020-03-11T02:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.