Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning
- URL: http://arxiv.org/abs/2111.07180v1
- Date: Sat, 13 Nov 2021 19:54:15 GMT
- Title: Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning
- Authors: Yizhen Zhang, Minkyu Choi, Kuan Han, Zhongming Liu
- Abstract summary: We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
- Score: 3.441021278275805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In natural language processing, most models try to learn semantic
representations merely from texts. The learned representations encode the
distributional semantics but fail to connect to any knowledge about the
physical world. In contrast, humans learn language by grounding concepts in
perception and action and the brain encodes grounded semantics for cognition.
Inspired by this notion and recent work in vision-language learning, we design
a two-stream model for grounding language learning in vision. The model
includes a VGG-based visual stream and a Bert-based language stream. The two
streams merge into a joint representational space. Through cross-modal
contrastive learning, the model first learns to align visual and language
representations with the MS COCO dataset. The model further learns to retrieve
visual objects with language queries through a cross-modal attention module and
to infer the visual relations between the retrieved objects through a bilinear
operator with the Visual Genome dataset. After training, the language stream of
this model is a stand-alone language model capable of embedding concepts in a
visually grounded semantic space. This semantic space manifests principal
dimensions explainable with human intuition and neurobiological knowledge. Word
embeddings in this semantic space are predictive of human-defined norms of
semantic features and are segregated into perceptually distinctive clusters.
Furthermore, the visually grounded language model also enables compositional
language understanding based on visual knowledge and multimodal image search
with queries based on images, texts, or their combinations.
Related papers
- Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Augmenting Vision Language Pretraining by Learning Codebook with Visual
Semantics [29.393661499333284]
We propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic.
We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective.
Experiments validate the effectiveness of our approach across common vision-language benchmarks.
arXiv Detail & Related papers (2022-07-31T17:36:09Z) - Pretraining on Interactions for Learning Grounded Affordance
Representations [22.290431852705662]
We train a neural network to predict objects' trajectories in a simulated interaction.
We show that our network's latent representations differentiate between both observed and unobserved affordances.
Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
arXiv Detail & Related papers (2022-07-05T19:19:53Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Language Models as Zero-shot Visual Semantic Learners [0.618778092044887]
We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
arXiv Detail & Related papers (2021-07-26T08:22:55Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.