On the Significance of Question Encoder Sequence Model in the
Out-of-Distribution Performance in Visual Question Answering
- URL: http://arxiv.org/abs/2108.12585v1
- Date: Sat, 28 Aug 2021 05:51:27 GMT
- Title: On the Significance of Question Encoder Sequence Model in the
Out-of-Distribution Performance in Visual Question Answering
- Authors: Gouthaman KV, Anurag Mittal
- Abstract summary: Generalizing beyond the experiences has a significant role in developing practical AI systems.
Current Visual Question Answering (VQA) models are over-dependent on the language-priors.
This paper shows that the sequence model architecture used in the question-encoder has a significant role in the generalizability of VQA models.
- Score: 15.787663289343948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalizing beyond the experiences has a significant role in developing
practical AI systems. It has been shown that current Visual Question Answering
(VQA) models are over-dependent on the language-priors (spurious correlations
between question-types and their most frequent answers) from the train set and
pose poor performance on Out-of-Distribution (OOD) test sets. This conduct
limits their generalizability and restricts them from being utilized in
real-world situations. This paper shows that the sequence model architecture
used in the question-encoder has a significant role in the generalizability of
VQA models. To demonstrate this, we performed a detailed analysis of various
existing RNN-based and Transformer-based question-encoders, and along, we
proposed a novel Graph attention network (GAT)-based question-encoder. Our
study found that a better choice of sequence model in the question-encoder
improves the generalizability of VQA models even without using any additional
relatively complex bias-mitigation approaches.
Related papers
- QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems [3.486120902611884]
This paper explores the significance of different question types for VQA systems and their impact on performance.
We propose QTG-VQA, a novel architecture that incorporates question-type-guided attention and adaptive learning mechanism.
arXiv Detail & Related papers (2024-09-14T07:42:41Z) - Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture [9.244633039170186]
We propose a plug-and-play Entropy-based Scaling Factor (ESF) and a Distribution-Specific (DS) decoder.
ESF adjusts the attention weight pattern of the model towards familiar ones discovered during training when solving VRPs of varying sizes.
DS decoder explicitly models VRPs of multiple training distribution patterns through multiple auxiliary light decoders, expanding the model representation space.
arXiv Detail & Related papers (2024-06-10T09:03:17Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Logical Implications for Visual Question Answering Consistency [2.005299372367689]
We introduce a new consistency loss term that can be used by a wide range of the VQA models.
We propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function.
We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models.
arXiv Detail & Related papers (2023-03-16T16:00:18Z) - Attention-guided Generative Models for Extractive Question Answering [17.476450946279037]
Recently, pretrained generative sequence-to-sequence (seq2seq) models have achieved great success in question answering.
We propose a simple strategy to obtain an extractive answer span from the generative model by leveraging the decoder cross-attention patterns.
arXiv Detail & Related papers (2021-10-12T23:02:35Z) - X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization
in Visual Question Answering [49.36818290978525]
Recompositions of existing visual concepts can generate unseen compositions in the training set.
We propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly.
The baseline VQA model trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks.
arXiv Detail & Related papers (2021-07-24T10:17:48Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.