Context-VQA: Towards Context-Aware and Purposeful Visual Question
Answering
- URL: http://arxiv.org/abs/2307.15745v2
- Date: Wed, 30 Aug 2023 15:58:56 GMT
- Title: Context-VQA: Towards Context-Aware and Purposeful Visual Question
Answering
- Authors: Nandita Naik, Christopher Potts, Elisa Kreiss
- Abstract summary: Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way.
People who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears.
We argue that VQA models will not fully succeed at meeting people's needs unless they take context into account.
- Score: 17.675630617265288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) has the potential to make the Internet more
accessible in an interactive way, allowing people who cannot see images to ask
questions about them. However, multiple studies have shown that people who are
blind or have low-vision prefer image explanations that incorporate the context
in which an image appears, yet current VQA datasets focus on images in
isolation. We argue that VQA models will not fully succeed at meeting people's
needs unless they take context into account. To further motivate and analyze
the distinction between different contexts, we introduce Context-VQA, a VQA
dataset that pairs images with contexts, specifically types of websites (e.g.,
a shopping website). We find that the types of questions vary systematically
across contexts. For example, images presented in a travel context garner 2
times more "Where?" questions, and images on social media and news garner 2.8
and 1.8 times more "Who?" questions than the average. We also find that context
effects are especially important when participants can't see the image. These
results demonstrate that context affects the types of questions asked and that
VQA models should be context-sensitive to better meet people's needs,
especially in accessibility settings.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - CommVQA: Situating Visual Question Answering in Communicative Contexts [16.180130883242672]
We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear.
We show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model.
arXiv Detail & Related papers (2024-02-22T22:31:39Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - A Picture May Be Worth a Hundred Words for Visual Question Answering [26.83504716672634]
In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
arXiv Detail & Related papers (2021-06-25T06:13:14Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.