How good are deep models in understanding the generated images?
- URL: http://arxiv.org/abs/2208.10760v2
- Date: Thu, 25 Aug 2022 03:32:33 GMT
- Title: How good are deep models in understanding the generated images?
- Authors: Ali Borji
- Abstract summary: Two sets of generated images are collected for object recognition and visual question answering tasks.
On object recognition, the best model, out of 10 state-of-the-art object recognition models, achieves about 60% and 80% top-1 and top-5 accuracy.
On VQA, the OFA model scores 77.3% on answering 241 binary questions across 50 images.
- Score: 47.64219291655723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: My goal in this paper is twofold: to study how well deep models can
understand the images generated by DALL-E 2 and Midjourney, and to
quantitatively evaluate these generative models. Two sets of generated images
are collected for object recognition and visual question answering (VQA) tasks.
On object recognition, the best model, out of 10 state-of-the-art object
recognition models, achieves about 60\% and 80\% top-1 and top-5 accuracy,
respectively. These numbers are much lower than the best accuracy on the
ImageNet dataset (91\% and 99\%). On VQA, the OFA model scores 77.3\% on
answering 241 binary questions across 50 images. This model scores 94.7\% on
the binary VQA-v2 dataset. Humans are able to recognize the generated images
and answer questions on them easily. We conclude that a) deep models struggle
to understand the generated content, and may do better after fine-tuning, and
b) there is a large distribution shift between the generated images and the
real photographs. The distribution shift appears to be category-dependent. Data
is available at:
https://drive.google.com/file/d/1n2nCiaXtYJRRF2R73-LNE3zggeU_HeH0/view?usp=sharing.
Related papers
- How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold [50.33428591760124]
We study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it.
We propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch.
arXiv Detail & Related papers (2024-10-19T06:28:14Z) - Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering [13.490305443938817]
We introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel evaluation metric.
I-HallA measures the factuality of generated images through visual question answering (VQA)
We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information.
arXiv Detail & Related papers (2024-09-19T13:51:21Z) - Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object
Classification [47.64219291655723]
We introduce a new test set, called D2O, which is sufficiently different from existing test sets.
Our dataset contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet.
The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2023-01-29T19:58:32Z) - BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution
Generalization of VQA Models [47.64219291655723]
We introduce a new test set for visual question answering (VQA) called BinaryVQA to push the limits of VQA models.
Our dataset includes 7,800 questions across 1,024 images and covers a wide variety of objects, topics, and concepts.
Around 63% of the questions have positive answers.
arXiv Detail & Related papers (2023-01-28T00:03:44Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations.
We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z) - Rethinking Recurrent Neural Networks and Other Improvements for Image
Classification [1.5990720051907859]
We propose integrating an RNN as an additional layer when designing image recognition models.
We also develop end-to-end multimodel ensembles that produce expert predictions using several models.
Our model sets a new record on the Surrey dataset.
arXiv Detail & Related papers (2020-07-30T00:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.