Component Analysis for Visual Question Answering Architectures
- URL: http://arxiv.org/abs/2002.05104v2
- Date: Fri, 27 Mar 2020 01:08:38 GMT
- Title: Component Analysis for Visual Question Answering Architectures
- Authors: Camila Kolling, J\^onatas Wehrmann, and Rodrigo C. Barros
- Abstract summary: The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in Visual Question Answering models.
Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
- Score: 10.56011196733086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research advances in Computer Vision and Natural Language Processing
have introduced novel tasks that are paving the way for solving AI-complete
problems. One of those tasks is called Visual Question Answering (VQA). A VQA
system must take an image and a free-form, open-ended natural language question
about the image, and produce a natural language answer as the output. Such a
task has drawn great attention from the scientific community, which generated a
plethora of approaches that aim to improve the VQA predictive accuracy. Most of
them comprise three major components: (i) independent representation learning
of images and questions; (ii) feature fusion so the model can use information
from both sources to answer visual questions; and (iii) the generation of the
correct answer in natural language. With so many approaches being recently
introduced, it became unclear the real contribution of each component for the
ultimate performance of the model. The main goal of this paper is to provide a
comprehensive analysis regarding the impact of each component in VQA models.
Our extensive set of experiments cover both visual and textual elements, as
well as the combination of these representations in form of fusion and
attention mechanisms. Our major contribution is to identify core components for
training VQA models so as to maximize their predictive performance.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models [36.56689822791777]
Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions.
Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method.
Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets.
arXiv Detail & Related papers (2024-06-14T13:07:46Z) - Blind Image Quality Assessment via Vision-Language Correspondence: A
Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information.
We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - How to find a good image-text embedding for remote sensing visual
question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone.
We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.