Towards Image Semantics and Syntax Sequence Learning
- URL: http://arxiv.org/abs/2401.17515v1
- Date: Wed, 31 Jan 2024 00:16:02 GMT
- Title: Towards Image Semantics and Syntax Sequence Learning
- Authors: Chun Tao, Timur Ibrayev, Kaushik Roy
- Abstract summary: We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
- Score: 8.033697392628424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional neural networks and vision transformers have achieved
outstanding performance in machine perception, particularly for image
classification. Although these image classifiers excel at predicting
image-level class labels, they may not discriminate missing or shifted parts
within an object. As a result, they may fail to detect corrupted images that
involve missing or disarrayed semantic information in the object composition.
On the contrary, human perception easily distinguishes such corruptions. To
mitigate this gap, we introduce the concept of "image grammar", consisting of
"image semantics" and "image syntax", to denote the semantics of parts or
patches of an image and the order in which these parts are arranged to create a
meaningful object. To learn the image grammar relative to a class of visual
objects/scenes, we propose a weakly supervised two-stage approach. In the first
stage, we use a deep clustering framework that relies on iterative clustering
and feature refinement to produce part-semantic segmentation. In the second
stage, we incorporate a recurrent bi-LSTM module to process a sequence of
semantic segmentation patches to capture the image syntax. Our framework is
trained to reason over patch semantics and detect faulty syntax. We benchmark
the performance of several grammar learning models in detecting patch
corruptions. Finally, we verify the capabilities of our framework in Celeb and
SUNRGBD datasets and demonstrate that it can achieve a grammar validation
accuracy of 70 to 90% in a wide variety of semantic and syntactical corruption
scenarios.
Related papers
- TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - A semantics-driven methodology for high-quality image annotation [4.7590051176368915]
We propose vTelos, an integrated Natural Language Processing, Knowledge Representation, and Computer Vision methodology.
Key element of vTelos is the exploitation of the WordNet lexico-semantic hierarchy as the main means for providing the meaning of natural language labels.
The methodology is validated on images populating a subset of the ImageNet hierarchy.
arXiv Detail & Related papers (2023-07-26T11:38:45Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - iCAR: Bridging Image Classification and Image-text Alignment for Visual
Recognition [33.2800417526215]
Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade.
Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition.
We propose a deep fusion method with three adaptations that effectively bridge two learning tasks.
arXiv Detail & Related papers (2022-04-22T15:27:21Z) - Evaluating language-biased image classification based on semantic
representations [13.508894957080777]
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.
Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification.
arXiv Detail & Related papers (2022-01-26T15:46:36Z) - Maximize the Exploration of Congeneric Semantics for Weakly Supervised
Semantic Segmentation [27.155133686127474]
We construct a graph neural network (P-GNN) based on the self-detected patches from different images that contain the same class labels.
We conduct experiments on the popular PASCAL VOC 2012 benchmarks, and our model yields state-of-the-art performance.
arXiv Detail & Related papers (2021-10-08T08:59:16Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.