From Image to Language: A Critical Analysis of Visual Question Answering
(VQA) Approaches, Challenges, and Opportunities
- URL: http://arxiv.org/abs/2311.00308v1
- Date: Wed, 1 Nov 2023 05:39:41 GMT
- Title: From Image to Language: A Critical Analysis of Visual Question Answering
(VQA) Approaches, Challenges, and Opportunities
- Authors: Md Farhan Ishmam, Md Sakib Hossain Shovon, M.F. Mridha, Nilanjan Dey
- Abstract summary: The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
- Score: 2.259291861960906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The multimodal task of Visual Question Answering (VQA) encompassing elements
of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate
answers to questions on any visual input. Over time, the scope of VQA has
expanded from datasets focusing on an extensive collection of natural images to
datasets featuring synthetic images, video, 3D environments, and various other
visual inputs. The emergence of large pre-trained networks has shifted the
early VQA approaches relying on feature extraction and fusion schemes to vision
language pre-training (VLP) techniques. However, there is a lack of
comprehensive surveys that encompass both traditional VQA architectures and
contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of
VQA haven't been thoroughly explored, leaving room for potential open problems
to emerge. Our work presents a survey in the domain of VQA that delves into the
intricacies of VQA datasets and methods over the field's history, introduces a
detailed taxonomy to categorize the facets of VQA, and highlights the recent
trends, challenges, and scopes for improvement. We further generalize VQA to
multimodal question answering, explore tasks related to VQA, and present a set
of open problems for future investigation. The work aims to navigate both
beginners and experts by shedding light on the potential avenues of research
and expanding the boundaries of the field.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets [5.45761450227064]
We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
arXiv Detail & Related papers (2022-10-13T15:01:15Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - An experimental study of the vision-bottleneck in VQA [17.132865538874352]
We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
arXiv Detail & Related papers (2022-02-14T16:43:32Z) - Medical Visual Question Answering: A Survey [55.53205317089564]
Medical Visual Question Answering(VQA) is a combination of medical artificial intelligence and popular VQA challenges.
Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer.
arXiv Detail & Related papers (2021-11-19T05:55:15Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Recent Advances in Video Question Answering: A Review of Datasets and
Methods [0.0]
VQA helps to retrieve temporal and spatial information from the video scenes and interpret it.
To the best of our knowledge, no previous survey has been conducted for the VQA task.
arXiv Detail & Related papers (2021-01-15T03:26:24Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.