Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
- URL: http://arxiv.org/abs/2001.07059v1
- Date: Mon, 20 Jan 2020 11:27:21 GMT
- Title: Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
- Authors: Moshiur R. Farazi, Salman H. Khan, Nick Barnes
- Abstract summary: We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
- Score: 39.338304913058685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) has emerged as a Visual Turing Test to
validate the reasoning ability of AI agents. The pivot to existing VQA models
is the joint embedding that is learned by combining the visual features from an
image and the semantic features from a given question. Consequently, a large
body of literature has focused on developing complex joint embedding strategies
coupled with visual attention mechanisms to effectively capture the interplay
between these two modalities. However, modelling the visual and semantic
features in a high dimensional (joint embedding) space is computationally
expensive, and more complex models often result in trivial improvements in the
VQA accuracy. In this work, we systematically study the trade-off between the
model complexity and the performance on the VQA task. VQA models have a diverse
architecture comprising of pre-processing, feature extraction, multimodal
fusion, attention and final classification stages. We specifically focus on the
effect of "multi-modal fusion" in VQA models that is typically the most
expensive step in a VQA pipeline. Our thorough experimental evaluation leads us
to two proposals, one optimized for minimal complexity and the other one
optimized for state-of-the-art VQA performance.
Related papers
- Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion [6.9879884952138065]
The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model.
A ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy.
Our model significantly outperforms existing state-of-the-art models on standard VQA datasets.
arXiv Detail & Related papers (2024-08-14T05:18:43Z) - Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach [2.744781070632757]
We compare models that leverage long-range dependencies and simpler models focusing on local textual features within a well-established VQA framework.
We propose ConvGRU, a model that incorporates convolutional layers to improve text feature representation without substantially increasing model complexity.
Tested on the VQA-v2 dataset, ConvGRU demonstrates a modest yet consistent improvement over baselines for question types such as Number and Count.
arXiv Detail & Related papers (2024-05-01T12:39:35Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z) - How to find a good image-text embedding for remote sensing visual
question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone.
We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.