Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and
Word2Vec to generate Object and Scene Embeddings from Images
- URL: http://arxiv.org/abs/2009.09384v1
- Date: Sun, 20 Sep 2020 08:26:38 GMT
- Title: Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and
Word2Vec to generate Object and Scene Embeddings from Images
- Authors: Matthias S. Treder, Juan Mayor-Torres, Christoph Teufel
- Abstract summary: We develop two approaches for learning object and scene embeddings from annotated images.
In the first approach, we generate embeddings from object co-occurrences in whole images, one for objects and one for scenes.
In the second approach, rather than analyzing whole images of scenes, we focus on co-occurrences of objects within subregions of an image.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embeddings are an important tool for the representation of word meaning.
Their effectiveness rests on the distributional hypothesis: words that occur in
the same context carry similar semantic information. Here, we adapt this
approach to index visual semantics in images of scenes. To this end, we
formulate a distributional hypothesis for objects and scenes: Scenes that
contain the same objects (object context) are semantically related. Similarly,
objects that appear in the same spatial context (within a scene or subregions
of a scene) are semantically related. We develop two approaches for learning
object and scene embeddings from annotated images. In the first approach, we
adapt LSA and Word2vec's Skipgram and CBOW models to generate two sets of
embeddings from object co-occurrences in whole images, one for objects and one
for scenes. The representational space spanned by these embeddings suggests
that the distributional hypothesis holds for images. In an initial application
of this approach, we show that our image-based embeddings improve scene
classification models such as ResNet18 and VGG-11 (3.72\% improvement on Top5
accuracy, 4.56\% improvement on Top1 accuracy). In the second approach, rather
than analyzing whole images of scenes, we focus on co-occurrences of objects
within subregions of an image. We illustrate that this method yields a sensible
hierarchical decomposition of a scene into collections of semantically related
objects. Overall, these results suggest that object and scene embeddings from
object co-occurrences and spatial context yield semantically meaningful
representations as well as computational improvements for downstream
applications such as scene classification.
Related papers
- StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Describing Sets of Images with Textual-PCA [89.46499914148993]
We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set.
Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases.
arXiv Detail & Related papers (2022-10-21T17:10:49Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net)
In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects.
The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.