Towards Visual Syntactical Understanding
- URL: http://arxiv.org/abs/2401.17497v1
- Date: Tue, 30 Jan 2024 23:05:43 GMT
- Title: Towards Visual Syntactical Understanding
- Authors: Sayeed Shafayet Chowdhury, Soumyadeep Chandra, and Kaushik Roy
- Abstract summary: We investigate whether deep neural networks (DNNs) are equipped with visual syntactic understanding.
We propose a three-stage framework- (i) the 'words' in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness.
We obtain classification accuracy of 92.10%, and 90.89%, respectively, on CelebA and AFHQ datasets.
- Score: 8.530698703124159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Syntax is usually studied in the realm of linguistics and refers to the
arrangement of words in a sentence. Similarly, an image can be considered as a
visual 'sentence', with the semantic parts of the image acting as 'words'.
While visual syntactic understanding occurs naturally to humans, it is
interesting to explore whether deep neural networks (DNNs) are equipped with
such reasoning. To that end, we alter the syntax of natural images (e.g.
swapping the eye and nose of a face), referred to as 'incorrect' images, to
investigate the sensitivity of DNNs to such syntactic anomaly. Through our
experiments, we discover an intriguing property of DNNs where we observe that
state-of-the-art convolutional neural networks, as well as vision transformers,
fail to discriminate between syntactically correct and incorrect images when
trained on only correct ones. To counter this issue and enable visual syntactic
understanding with DNNs, we propose a three-stage framework- (i) the 'words'
(or the sub-features) in the image are detected, (ii) the detected words are
sequentially masked and reconstructed using an autoencoder, (iii) the original
and reconstructed parts are compared at each location to determine syntactic
correctness. The reconstruction module is trained with BERT-like masked
autoencoding for images, with the motivation to leverage language model
inspired training to better capture the syntax. Note, our proposed approach is
unsupervised in the sense that the incorrect images are only used during
testing and the correct versus incorrect labels are never used for training. We
perform experiments on CelebA, and AFHQ datasets and obtain classification
accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes
well to ImageNet samples which share common classes with CelebA and AFHQ
without explicitly training on them.
Related papers
- Towards Image Semantics and Syntax Sequence Learning [8.033697392628424]
We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
arXiv Detail & Related papers (2024-01-31T00:16:02Z) - Improving Generalization of Image Captioning with Unsupervised Prompt
Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z) - SynthMorph: learning contrast-invariant registration without acquired
images [8.0963891430422]
We introduce a strategy for learning image registration without acquired imaging data.
We show that this strategy enables robust and accurate registration of arbitrary MRI contrasts.
arXiv Detail & Related papers (2020-04-21T20:29:39Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.