Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2010.14953v1
- Date: Wed, 28 Oct 2020 13:11:34 GMT
- Title: Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
- Authors: Stanislav Frolov, Shailza Jolly, J\"orn Hees, Andreas Dengel
- Abstract summary: We propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment.
We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal.
Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline.
- Score: 5.4897944234841445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating images from textual descriptions has recently attracted a lot of
interest. While current models can generate photo-realistic images of
individual objects such as birds and human faces, synthesising images with
multiple objects is still very difficult. In this paper, we propose an
effective way to combine Text-to-Image (T2I) synthesis with Visual Question
Answering (VQA) to improve the image quality and image-text alignment of
generated images by leveraging the VQA 2.0 dataset. We create additional
training samples by concatenating question and answer (QA) pairs and employ a
standard VQA model to provide the T2I model with an auxiliary learning signal.
We encourage images generated from QA pairs to look realistic and additionally
minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38
and increases the R-prec. from 83.82% to 84.79% when compared to the baseline,
which indicates that T2I synthesis can successfully be improved using a
standard VQA model.
Related papers
- VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - All You May Need for VQA are Image Captions [24.634567673906666]
We propose a method that automatically derives VQA examples at volume.
We show that the resulting data is of high-quality.
VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits.
arXiv Detail & Related papers (2022-05-04T04:09:23Z) - A Picture May Be Worth a Hundred Words for Visual Question Answering [26.83504716672634]
In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
arXiv Detail & Related papers (2021-06-25T06:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.