Estimating semantic structure for the VQA answer space
- URL: http://arxiv.org/abs/2006.05726v2
- Date: Thu, 8 Apr 2021 10:33:21 GMT
- Title: Estimating semantic structure for the VQA answer space
- Authors: Corentin Kervadec (imagine), Grigory Antipov, Moez Baccouche,
Christian Wolf (imagine)
- Abstract summary: We show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models.
We report SOTA-level performance on the challenging VQAv2-CP dataset.
- Score: 6.49970685896541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since its appearance, Visual Question Answering (VQA, i.e. answering a
question posed over an image), has always been treated as a classification
problem over a set of predefined answers. Despite its convenience, this
classification approach poorly reflects the semantics of the problem limiting
the answering to a choice between independent proposals, without taking into
account the similarity between them (e.g. equally penalizing for answering cat
or German shepherd instead of dog). We address this issue by proposing (1) two
measures of proximity between VQA classes, and (2) a corresponding loss which
takes into account the estimated proximity. This significantly improves the
generalization of VQA models by reducing their language bias. In particular, we
show that our approach is completely model-agnostic since it allows consistent
improvements with three different VQA models. Finally, by combining our method
with a language bias reduction approach, we report SOTA-level performance on
the challenging VQAv2-CP dataset.
Related papers
- Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances [17.637150597493463]
We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances.
We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space.
Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
arXiv Detail & Related papers (2022-09-18T10:30:44Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder [12.56413718364189]
We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-07-13T05:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.