CommVQA: Situating Visual Question Answering in Communicative Contexts
- URL: http://arxiv.org/abs/2402.15002v1
- Date: Thu, 22 Feb 2024 22:31:39 GMT
- Title: CommVQA: Situating Visual Question Answering in Communicative Contexts
- Authors: Nandita Shankar Naik, Christopher Potts, Elisa Kreiss
- Abstract summary: We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear.
We show that CommVQA poses a challenge for current models.
- Score: 17.675630617265288
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current visual question answering (VQA) models tend to be trained and
evaluated on image-question pairs in isolation. However, the questions people
ask are dependent on their informational needs and prior knowledge about the
image content. To evaluate how situating images within naturalistic contexts
shapes visual questions, we introduce CommVQA, a VQA dataset consisting of
images, image descriptions, real-world communicative scenarios where the image
might appear (e.g., a travel website), and follow-up questions and answers
conditioned on the scenario. We show that CommVQA poses a challenge for current
models. Providing contextual information to VQA models improves performance
broadly, highlighting the relevance of situating systems within a communicative
scenario.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Context-VQA: Towards Context-Aware and Purposeful Visual Question
Answering [17.675630617265288]
Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way.
People who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears.
We argue that VQA models will not fully succeed at meeting people's needs unless they take context into account.
arXiv Detail & Related papers (2023-07-28T18:01:08Z) - Can Pre-trained Vision and Language Models Answer Visual
Information-Seeking Questions? [50.29862466940209]
We introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions.
We analyze various pre-trained visual question answering models and gain insights into their characteristics.
We show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents.
arXiv Detail & Related papers (2023-02-23T00:33:54Z) - Can Open Domain Question Answering Systems Answer Visual Knowledge
Questions? [7.442099405543527]
We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions.
This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering.
We propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions.
arXiv Detail & Related papers (2022-02-09T06:47:40Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.