Seeing the advantage: visually grounding word embeddings to better
capture human semantic knowledge
- URL: http://arxiv.org/abs/2202.10292v1
- Date: Mon, 21 Feb 2022 15:13:48 GMT
- Title: Seeing the advantage: visually grounding word embeddings to better
capture human semantic knowledge
- Authors: Danny Merkx, Stefan L. Frank and Mirjam Ernestus
- Abstract summary: Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks.
We create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods.
Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times than the purely text-based embeddings.
- Score: 8.208534667678792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributional semantic models capture word-level meaning that is useful in
many natural language processing tasks and have even been shown to capture
cognitive aspects of word meaning. The majority of these models are purely text
based, even though the human sensory experience is much richer. In this paper
we create visually grounded word embeddings by combining English text and
images and compare them to popular text-based methods, to see if visual
information allows our model to better capture cognitive aspects of word
meaning. Our analysis shows that visually grounded embedding similarities are
more predictive of the human reaction times in a large priming experiment than
the purely text-based embeddings. The visually grounded embeddings also
correlate well with human word similarity ratings. Importantly, in both
experiments we show that the grounded embeddings account for a unique portion
of explained variance, even when we include text-based embeddings trained on
huge corpora. This shows that visual grounding allows our model to capture
information that cannot be extracted using text as the only source of
information.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - Language with Vision: a Study on Grounded Word and Sentence Embeddings [6.231247903840833]
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations.
The present study proposes a computational grounding model for pre-trained word embeddings.
Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information.
arXiv Detail & Related papers (2022-06-17T15:04:05Z) - Words are all you need? Capturing human sensory similarity with textual
descriptors [12.191617984664683]
We explore the relation between human similarity judgments and language.
We introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general.
We show that our prediction pipeline based on text descriptors exhibits excellent performance.
arXiv Detail & Related papers (2022-06-08T18:09:19Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Efficient Multi-Modal Embeddings from Structured Data [0.0]
Multi-modal word semantics aims to enhance embeddings with perceptual input.
Visual grounding can contribute to linguistic applications as well.
New embedding conveys complementary information for text based embeddings.
arXiv Detail & Related papers (2021-10-06T08:42:09Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.