Robust Visual Question Answering: Datasets, Methods, and Future
Challenges
- URL: http://arxiv.org/abs/2307.11471v2
- Date: Sun, 18 Feb 2024 08:00:19 GMT
- Title: Robust Visual Question Answering: Datasets, Methods, and Future
Challenges
- Authors: Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei,
Junzhou Zhao
- Abstract summary: Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question.
Previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers.
Various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively.
- Score: 23.59923999144776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering requires a system to provide an accurate natural
language answer given an image and a natural language question. However, it is
widely recognized that previous generic VQA methods often exhibit a tendency to
memorize biases present in the training data rather than learning proper
behaviors, such as grounding images before predicting answers. Therefore, these
methods usually achieve high in-distribution but poor out-of-distribution
performance. In recent years, various datasets and debiasing methods have been
proposed to evaluate and enhance the VQA robustness, respectively. This paper
provides the first comprehensive survey focused on this emerging fashion.
Specifically, we first provide an overview of the development process of
datasets from in-distribution and out-of-distribution perspectives. Then, we
examine the evaluation metrics employed by these datasets. Thirdly, we propose
a typology that presents the development process, similarities and differences,
robustness comparison, and technical features of existing debiasing methods.
Furthermore, we analyze and discuss the robustness of representative
vision-and-language pre-training models on VQA. Finally, through a thorough
review of the available literature and experimental analysis, we discuss the
key areas for future research from various viewpoints.
Related papers
- Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - How to Determine the Most Powerful Pre-trained Language Model without
Brute Force Fine-tuning? An Empirical Survey [23.757740341834126]
We show that H-Score generally performs well with superiorities in effectiveness and efficiency.
We also outline the difficulties of consideration of training details, applicability to text generation, and consistency to certain metrics which shed light on future directions.
arXiv Detail & Related papers (2023-12-08T01:17:28Z) - The curse of language biases in remote sensing VQA: the role of spatial
attributes, language diversity, and the need for clear evaluation [32.7348470366509]
The goal of RSVQA is to answer a question formulated in natural language about a remote sensing image.
The problem of language biases is often overlooked in the remote sensing community.
The present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy.
arXiv Detail & Related papers (2023-11-28T13:45:15Z) - Out-of-Distribution Generalization in Text Classification: Past,
Present, and Future [30.581612475530974]
Machine learning (ML) systems in natural language processing (NLP) face significant challenges in generalizing to out-of-distribution (OOD) data.
This poses important questions about the robustness of NLP models and their high accuracy, which may be artificially inflated due to their underlying sensitivity to systematic biases.
This paper presents the first comprehensive review of recent progress, methods, and evaluations on this topic.
arXiv Detail & Related papers (2023-05-23T14:26:11Z) - An Empirical Study on the Language Modal in Visual Question Answering [31.692905677913068]
Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain.
This paper attempts to provide new insights into the influence of language modality on VQA performance.
arXiv Detail & Related papers (2023-05-17T11:56:40Z) - Fairness meets Cross-Domain Learning: a new perspective on Models and
Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness.
We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks.
Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z) - Language bias in Visual Question Answering: A Survey and Taxonomy [0.0]
We conduct a comprehensive review and analysis of this field for the first time.
We classify the existing methods according to three categories, including enhancing visual information.
The causes of language bias are revealed and classified.
arXiv Detail & Related papers (2021-11-16T15:01:24Z) - Introspective Distillation for Robust Question Answering [70.18644911309468]
Question answering (QA) models are well-known to exploit data bias, e.g., the language prior in visual QA and the position bias in reading comprehension.
Recent debiasing methods achieve good out-of-distribution (OOD) generalizability with a considerable sacrifice of the in-distribution (ID) performance.
We present a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA.
arXiv Detail & Related papers (2021-11-01T15:30:15Z) - Deep Learning Schema-based Event Extraction: Literature Review and
Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot.
This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.