Language Models as Zero-shot Visual Semantic Learners
- URL: http://arxiv.org/abs/2107.12021v1
- Date: Mon, 26 Jul 2021 08:22:55 GMT
- Title: Language Models as Zero-shot Visual Semantic Learners
- Authors: Yue Jiao, Jonathon Hare, Adam Pr\"ugel-Bennett
- Abstract summary: We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
- Score: 0.618778092044887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Semantic Embedding (VSE) models, which map images into a rich semantic
embedding space, have been a milestone in object recognition and zero-shot
learning. Current approaches to VSE heavily rely on static word em-bedding
techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP)
designed to probe the semantic information of contextualized word embeddings in
visual semantic understanding tasks. We show that the knowledge encoded in
transformer language models can be exploited for tasks requiring visual
semantic understanding.The VSEP with contextual representations can distinguish
word-level object representations in complicated scenes as a compositional
zero-shot learner. We further introduce a zero-shot setting with VSEPs to
evaluate a model's ability to associate a novel word with a novel visual
category. We find that contextual representations in language mod-els
outperform static word embeddings, when the compositional chain of object is
short. We notice that current visual semantic embedding models lack a mutual
exclusivity bias which limits their performance.
Related papers
- StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - What Remains of Visual Semantic Embeddings [0.618778092044887]
We introduce the split of tiered-ImageNet to the ZSL task to avoid the structural flaws in the standard ImageNet benchmark.
We build a unified framework for ZSL with contrastive learning as pre-training, which guarantees no semantic information leakage.
Our work makes it fair for evaluating visual semantic embedding models on a ZSL setting in which semantic inference is decisive.
arXiv Detail & Related papers (2021-07-26T06:55:11Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.