The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering
- URL: http://arxiv.org/abs/2501.07109v1
- Date: Mon, 13 Jan 2025 07:43:33 GMT
- Title: The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering
- Authors: Anupam Pandey, Deepjyoti Bodo, Arpan Phukan, Asif Ekbal,
- Abstract summary: Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing (NLP)
Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models.
This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods.
- Score: 17.43904098033175
- License:
- Abstract: Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such as dataset bias, model interpretability, and the need for common-sense reasoning. Lastly, we discuss the emerging trends in large multimodal language models and the integration of external knowledge, offering insights into the future directions of VQA. This paper aims to provide a comprehensive overview of the evolution of VQA, highlighting both its current state and potential advancements.
Related papers
- Visual question answering: from early developments to recent advances -- a survey [11.729464930866483]
Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content.
VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning.
arXiv Detail & Related papers (2025-01-07T17:00:35Z) - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey [17.33078069581465]
Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques.
This survey gives an up-to-date synthesis of natural language understanding of images and text.
arXiv Detail & Related papers (2024-11-26T16:21:03Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Vision-language-action models (VLAs) have become a foundational element in robot learning.
Various methods have been proposed to enhance traits such as versatility, dexterity, and generalizability.
VLAs serve as high-level task planners capable of decomposing long-horizon tasks into executable subtasks.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities [2.0681376988193843]
The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
arXiv Detail & Related papers (2023-11-01T05:39:41Z) - VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and
Challenges [1.565870461096057]
The integration of vision and language has sparked a lot of attention as a result of this.
The tasks have been created in such a way that they properly exemplify the concepts of deep learning.
arXiv Detail & Related papers (2022-12-26T20:56:01Z) - Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions [68.6358773622615]
This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
arXiv Detail & Related papers (2022-09-07T19:21:19Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA.
Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.