Point and Ask: Incorporating Pointing into Visual Question Answering
- URL: http://arxiv.org/abs/2011.13681v4
- Date: Fri, 18 Feb 2022 05:50:50 GMT
- Title: Point and Ask: Incorporating Pointing into Visual Question Answering
- Authors: Arjun Mani, Nobline Yoo, Will Hinthorn, Olga Russakovsky
- Abstract summary: We introduce and motivate point-input questions as an extension of Visual Question Answering (VQA)
Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region.
We uncover and address several visual recognition challenges, including the ability to infer human intent.
- Score: 14.744503080484977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) has become one of the key benchmarks of
visual recognition progress. Multiple VQA extensions have been explored to
better simulate real-world settings: different question formulations, changing
training and test distributions, conversational consistency in dialogues, and
explanation-based answering. In this work, we further expand this space by
considering visual questions that include a spatial point of reference.
Pointing is a nearly universal gesture among humans, and real-world VQA is
likely to involve a gesture towards the target region.
Concretely, we (1) introduce and motivate point-input questions as an
extension of VQA, (2) define three novel classes of questions within this
space, and (3) for each class, introduce both a benchmark dataset and a series
of baseline models to handle its unique challenges. There are two key
distinctions from prior work. First, we explicitly design the benchmarks to
require the point input, i.e., we ensure that the visual question cannot be
answered accurately without the spatial reference. Second, we explicitly
explore the more realistic point spatial input rather than the standard but
unnatural bounding box input. Through our exploration we uncover and address
several visual recognition challenges, including the ability to infer human
intent, reason both locally and globally about the image, and effectively
combine visual, language and spatial inputs. Code is available at:
https://github.com/princetonvisualai/pointingqa .
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue [0.0]
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
arXiv Detail & Related papers (2021-06-29T16:36:34Z) - Visual Question Answering: which investigated applications? [14.332672914799272]
In VQA semantic information in the same media must be compared with the semantics implied by a question expressed in natural language.
This paper considers the proposals that focus on real-world applications, possibly using as benchmarks suitable data bound to the application domain.
arXiv Detail & Related papers (2021-03-04T10:38:06Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.