Visuo-Linguistic Question Answering (VLQA) Challenge
- URL: http://arxiv.org/abs/2005.00330v3
- Date: Wed, 18 Nov 2020 07:45:20 GMT
- Title: Visuo-Linguistic Question Answering (VLQA) Challenge
- Authors: Shailaja Keyur Sampat, Yezhou Yang and Chitta Baral
- Abstract summary: We propose a novel task to derive joint inference about a given image-text modality.
We compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting.
- Score: 47.54738740910987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding images and text together is an important aspect of cognition
and building advanced Artificial Intelligence (AI) systems. As a community, we
have achieved good benchmarks over language and vision domains separately,
however joint reasoning is still a challenge for state-of-the-art computer
vision and natural language processing (NLP) systems. We propose a novel task
to derive joint inference about a given image-text modality and compile the
Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question
answering setting. Each dataset item consists of an image and a reading
passage, where questions are designed to combine both visual and textual
information i.e., ignoring either modality would make the question
unanswerable. We first explore the best existing vision-language architectures
to solve VLQA subsets and show that they are unable to reason well. We then
develop a modular method with slightly better baseline performance, but it is
still far behind human performance. We believe that VLQA will be a good
benchmark for reasoning over a visuo-linguistic context. The dataset, code and
leaderboard is available at https://shailaja183.github.io/vlqa/.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question
Answering with Hypothetical Actions over Images [31.317663183139384]
We take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario.
We formulate a vision-language question answering task based on the CLEVR dataset.
arXiv Detail & Related papers (2021-04-13T07:29:21Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.