Building a visual semantics aware object hierarchy
- URL: http://arxiv.org/abs/2202.13021v1
- Date: Sat, 26 Feb 2022 00:10:21 GMT
- Title: Building a visual semantics aware object hierarchy
- Authors: Xiaolei Diao
- Abstract summary: We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The semantic gap is defined as the difference between the linguistic
representations of the same concept, which usually leads to misunderstanding
between individuals with different knowledge backgrounds. Since linguistically
annotated images are extensively used for training machine learning models,
semantic gap problem (SGP) also results in inevitable bias on image annotations
and further leads to poor performance on current computer vision tasks. To
address this problem, we propose a novel unsupervised method to build visual
semantics aware object hierarchy, aiming to get a classification model by
learning from pure-visual information and to dissipate the bias of linguistic
representations caused by SGP. Our intuition in this paper comes from
real-world knowledge representation where concepts are hierarchically
organized, and each concept can be described by a set of features rather than a
linguistic annotation, namely visual semantic. The evaluation consists of two
parts, firstly we apply the constructed hierarchy on the object recognition
task and then we compare our visual hierarchy and existing lexical hierarchies
to show the validity of our method. The preliminary results reveal the
efficiency and potential of our proposed method.
Related papers
- Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - A semantics-driven methodology for high-quality image annotation [4.7590051176368915]
We propose vTelos, an integrated Natural Language Processing, Knowledge Representation, and Computer Vision methodology.
Key element of vTelos is the exploitation of the WordNet lexico-semantic hierarchy as the main means for providing the meaning of natural language labels.
The methodology is validated on images populating a subset of the ImageNet hierarchy.
arXiv Detail & Related papers (2023-07-26T11:38:45Z) - Describe me an Aucklet: Generating Grounded Perceptual Category
Descriptions [2.7195102129095003]
We introduce a framework for testing category-level perceptual grounding in multi-modal language models.
We train separate neural networks to generate and interpret descriptions of visual categories.
We show that communicative success exposes performance issues in the generation model.
arXiv Detail & Related papers (2023-03-07T17:01:25Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Cross-Modal Alignment Learning of Vision-Language Conceptual Systems [24.423011687551433]
We propose methods for learning aligned vision-language conceptual systems inspired by infants' word learning mechanisms.
The proposed model learns the associations of visual objects and words online and gradually constructs cross-modal relational graph networks.
arXiv Detail & Related papers (2022-07-31T08:39:53Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.