Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models
- URL: http://arxiv.org/abs/2312.02431v1
- Date: Tue, 5 Dec 2023 02:17:29 GMT
- Title: Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models
- Authors: Alessandro Suglia and Ioannis Konstas and Oliver Lemon
- Abstract summary: Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
- Score: 60.2604624857992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, several machine learning models have been proposed. They are
trained with a language modelling objective on large-scale text-only data. With
such pretraining, they can achieve impressive results on many Natural Language
Understanding and Generation tasks. However, many facets of meaning cannot be
learned by ``listening to the radio" only. In the literature, many
Vision+Language (V+L) tasks have been defined with the aim of creating models
that can ground symbols in the visual modality. In this work, we provide a
systematic literature review of several tasks and models proposed in the V+L
field. We rely on Wittgenstein's idea of `language games' to categorise such
tasks into 3 different families: 1) discriminative games, 2) generative games,
and 3) interactive games. Our analysis of the literature provides evidence that
future work should be focusing on interactive games where communication in
Natural Language is important to resolve ambiguities about object referents and
action plans and that physical embodiment is essential to understand the
semantics of situations and events. Overall, these represent key requirements
for developing grounded meanings in neural models.
Related papers
- Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z) - Multi-Sense Language Modelling [19.396806939258806]
We propose a language model which not only predicts the next word, but also its sense in context.
This higher prediction granularity may be useful for end tasks such as assistive writing.
For sense prediction, we utilise a Graph Attention Network, which encodes definitions and example uses of word senses.
arXiv Detail & Related papers (2020-12-10T16:06:05Z) - A Visuospatial Dataset for Naturalistic Verb Learning [18.654373173232205]
We introduce a new dataset for training and evaluating grounded language models.
Our data is collected within a virtual reality environment and is designed to emulate the quality of language data to which a pre-verbal child is likely to have access.
We use the collected data to compare several distributional semantics models for verb learning.
arXiv Detail & Related papers (2020-10-28T20:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.