ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
- URL: http://arxiv.org/abs/2402.12846v1
- Date: Tue, 20 Feb 2024 09:20:30 GMT
- Title: ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
- Authors: Li Mi, Syrielle Montariol, Javiera Castillo-Navarro, Xianjie Dai,
Antoine Bosselut, Devis Tuia
- Abstract summary: We propose Contrastive Visual Question Generation (ConVQG) to generate image-grounded, text-guided, and knowledge-rich questions.
Experiments on knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods.
- Score: 20.009626292937995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Asking questions about visual environments is a crucial way for intelligent
agents to understand rich multi-faceted scenes, raising the importance of
Visual Question Generation (VQG) systems. Apart from being grounded to the
image, existing VQG systems can use textual constraints, such as expected
answers or knowledge triplets, to generate focused questions. These constraints
allow VQG systems to specify the question content or leverage external
commonsense knowledge that can not be obtained from the image content only.
However, generating focused questions using textual constraints while enforcing
a high relevance to the image content remains a challenge, as VQG systems often
ignore one or both forms of grounding. In this work, we propose Contrastive
Visual Question Generation (ConVQG), a method using a dual contrastive
objective to discriminate questions generated using both modalities from those
based on a single one. Experiments on both knowledge-aware and standard VQG
benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and
generates image-grounded, text-guided, and knowledge-rich questions. Our human
evaluation results also show preference for ConVQG questions compared to
non-contrastive baselines.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation [64.64849950642619]
We develop an evaluation framework inspired by formal semantics for evaluating text-to-image models.
We show that Davidsonian Scene Graph (DSG) produces atomic and unique questions organized in dependency graphs.
We also present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts.
arXiv Detail & Related papers (2023-10-27T16:20:10Z) - Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets [5.45761450227064]
We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
arXiv Detail & Related papers (2022-10-13T15:01:15Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - K-VQG: Knowledge-aware Visual Question Generation for Common-sense
Acquisition [64.55573343404572]
We present a novel knowledge-aware VQG dataset called K-VQG.
This is the first large, humanly annotated dataset in which questions regarding images are tied to structured knowledge.
We also develop a new VQG model that can encode and use knowledge as the target for a question.
arXiv Detail & Related papers (2022-03-15T13:38:10Z) - Can Open Domain Question Answering Systems Answer Visual Knowledge
Questions? [7.442099405543527]
We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions.
This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering.
We propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions.
arXiv Detail & Related papers (2022-02-09T06:47:40Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image.
In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers.
Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z) - Understanding Knowledge Gaps in Visual Question Answering: Implications
for Gap Identification and Testing [20.117014315684287]
We use a taxonomy of Knowledge Gaps (KGs) to tag questions with one or more types of KGs.
We then examine the skew in the distribution of questions for each KG.
These new questions can be added to existing VQA datasets to increase the diversity of questions and reduce the skew.
arXiv Detail & Related papers (2020-04-08T00:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.