Language with Vision: a Study on Grounded Word and Sentence Embeddings
- URL: http://arxiv.org/abs/2206.08823v3
- Date: Tue, 31 Oct 2023 10:08:56 GMT
- Title: Language with Vision: a Study on Grounded Word and Sentence Embeddings
- Authors: Hassan Shahmohammadi, Maria Heitmeier, Elnaz Shafaei-Bajestan, Hendrik
P. A. Lensch, and Harald Baayen
- Abstract summary: Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations.
The present study proposes a computational grounding model for pre-trained word embeddings.
Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information.
- Score: 6.231247903840833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding language in vision is an active field of research seeking to
construct cognitively plausible word and sentence representations by
incorporating perceptual knowledge from vision into text-based representations.
Despite many attempts at language grounding, achieving an optimal equilibrium
between textual representations of the language and our embodied experiences
remains an open field. Some common concerns are the following. Is visual
grounding advantageous for abstract words, or is its effectiveness restricted
to concrete words? What is the optimal way of bridging the gap between text and
vision? To what extent is perceptual knowledge from images advantageous for
acquiring high-quality embeddings? Leveraging the current advances in machine
learning and natural language processing, the present study addresses these
questions by proposing a simple yet very effective computational grounding
model for pre-trained word embeddings. Our model effectively balances the
interplay between language and vision by aligning textual embeddings with
visual information while simultaneously preserving the distributional
statistics that characterize word usage in text corpora. By applying a learned
alignment, we are able to indirectly ground unseen words including abstract
words. A series of evaluations on a range of behavioural datasets shows that
visual grounding is beneficial not only for concrete words but also for
abstract words, lending support to the indirect theory of abstract concepts.
Moreover, our approach offers advantages for contextualized embeddings, such as
those generated by BERT, but only when trained on corpora of modest,
cognitively plausible sizes. Code and grounded embeddings for English are
available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Seeing the advantage: visually grounding word embeddings to better
capture human semantic knowledge [8.208534667678792]
Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks.
We create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods.
Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times than the purely text-based embeddings.
arXiv Detail & Related papers (2022-02-21T15:13:48Z) - Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z) - Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars.
Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z) - Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia
Multi-Task Training [8.271859911016719]
Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world.
We argue that this approach sacrifices the abstract knowledge obtained from linguistic co-occurrence statistics in the process of acquiring perceptual information.
arXiv Detail & Related papers (2021-04-15T14:49:11Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - Incorporating Visual Semantics into Sentence Representations within a
Grounded Space [20.784771968813747]
We propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space.
We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
arXiv Detail & Related papers (2020-02-07T12:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.