Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words
- URL: http://arxiv.org/abs/2210.09045v1
- Date: Mon, 17 Oct 2022 12:57:51 GMT
- Title: Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words
- Authors: Yousef Alqasrawi
- Abstract summary: This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of bag of visual words (BOW) model for modelling images based on
local invariant features computed at interest point locations has become a
standard choice for many computer vision tasks. Visual vocabularies generated
from image feature vectors are expected to produce visual words that are
discriminative to improve the performance of image annotation systems. Most
techniques that adopt the BOW model in annotating images declined favorable
information that can be mined from image categories to build discriminative
visual vocabularies. To this end, this paper introduces a detailed framework
for automatically annotating natural scene images with local semantic labels
from a predefined vocabulary. The framework is based on a hypothesis that
assumes that, in natural scenes, intermediate semantic concepts are correlated
with the local keypoints. Based on this hypothesis, image regions can be
efficiently represented by BOW model and using a machine learning approach,
such as SVM, to label image regions with semantic annotations. Another
objective of this paper is to address the implications of generating visual
vocabularies from image halves, instead of producing them from the whole image,
on the performance of annotating image regions with semantic labels. All
BOW-based approaches as well as baseline methods have been extensively
evaluated on 6-categories dataset of natural scenes using the SVM and KNN
classifiers. The reported results have shown the plausibility of using the BOW
model to represent the semantic information of image regions and thus to
automatically annotate image regions with labels.
Related papers
- Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z) - Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and
Word2Vec to generate Object and Scene Embeddings from Images [0.0]
We develop two approaches for learning object and scene embeddings from annotated images.
In the first approach, we generate embeddings from object co-occurrences in whole images, one for objects and one for scenes.
In the second approach, rather than analyzing whole images of scenes, we focus on co-occurrences of objects within subregions of an image.
arXiv Detail & Related papers (2020-09-20T08:26:38Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.