Barlow constrained optimization for Visual Question Answering
- URL: http://arxiv.org/abs/2203.03727v1
- Date: Mon, 7 Mar 2022 21:27:40 GMT
- Title: Barlow constrained optimization for Visual Question Answering
- Authors: Abhishek Jha, Badri N. Patro, Luc Van Gool, Tinne Tuytelaars
- Abstract summary: We propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB)
Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different views' of what in essence is the same semantic information.
When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4% and 4% on the VQA-CP v2 and VQA v2 datasets respectively.
- Score: 105.3372546782068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering is a vision-and-language multimodal task, that aims
at predicting answers given samples from the question and image modalities.
Most recent methods focus on learning a good joint embedding space of images
and questions, either by improving the interaction between these two
modalities, or by making it a more discriminant space. However, how informative
this joint space is, has not been well explored. In this paper, we propose a
novel regularization for VQA models, Constrained Optimization using Barlow's
theory (COB), that improves the information content of the joint space by
minimizing the redundancy. It reduces the correlation between the learned
feature components and thereby disentangles semantic concepts. Our model also
aligns the joint space with the answer embedding space, where we consider the
answer and image+question as two different `views' of what in essence is the
same semantic information. We propose a constrained optimization policy to
balance the categorical and redundancy minimization forces. When built on the
state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4%
and 4% on the VQA-CP v2 and VQA v2 datasets respectively. The model also
exhibits better interpretability.
Related papers
- LASERS: LAtent Space Encoding for Representations with Sparsity for Generative Modeling [3.9426000822656224]
We show that our more latent space is more expressive and has leads to better representations than the Vector Quantization approach.
Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space.
arXiv Detail & Related papers (2024-09-16T08:20:58Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.