Achieving Human Parity on Visual Question Answering
- URL: http://arxiv.org/abs/2111.08896v3
- Date: Fri, 19 Nov 2021 07:22:08 GMT
- Title: Achieving Human Parity on Visual Question Answering
- Authors: Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang,
Weihua Chen, Xianzhe Xu, Fan Wang, Zheng Cao, Zhicheng Zhang, Qiyu Zhang, Ji
Zhang, Songfang Huang, Fei Huang, Luo Si, Rong Jin
- Abstract summary: The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
- Score: 67.22500027651509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Visual Question Answering (VQA) task utilizes both visual image and
language analysis to answer a textual question with respect to an image. It has
been a popular research topic with an increasing number of real-world
applications in the last decade. This paper describes our recent research of
AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine
IntelligeNce lab of Damo academy - MultiMedia Understanding) that obtains
similar or even slightly better results than human being does on VQA. This is
achieved by systematically improving the VQA pipeline including: (1)
pre-training with comprehensive visual and textual feature representation; (2)
effective cross-modal interaction with learning to attend; and (3) A novel
knowledge mining framework with specialized expert modules for the complex VQA
task. Treating different types of visual questions with corresponding expertise
needed plays an important role in boosting the performance of our VQA
architecture up to the human level. An extensive set of experiments and
analysis are conducted to demonstrate the effectiveness of the new research
work.
Related papers
- VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities [2.0681376988193843]
The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
arXiv Detail & Related papers (2023-11-01T05:39:41Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - An experimental study of the vision-bottleneck in VQA [17.132865538874352]
We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
arXiv Detail & Related papers (2022-02-14T16:43:32Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Enhancing Visual Dialog Questioner with Entity-based Strategy Learning
and Augmented Guesser [43.42833961578857]
We propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs.
We also propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially.
Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity.
arXiv Detail & Related papers (2021-09-06T08:58:43Z) - Component Analysis for Visual Question Answering Architectures [10.56011196733086]
The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in Visual Question Answering models.
Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
arXiv Detail & Related papers (2020-02-12T17:25:50Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.