Cross-Modal Contrastive Learning for Robust Reasoning in VQA
- URL: http://arxiv.org/abs/2211.11190v1
- Date: Mon, 21 Nov 2022 05:32:24 GMT
- Title: Cross-Modal Contrastive Learning for Robust Reasoning in VQA
- Authors: Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, Dacheng Tao
- Abstract summary: Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently.
Most reasoning models heavily rely on shortcuts learned from training data.
We propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning.
- Score: 76.1596796687494
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal reasoning in visual question answering (VQA) has witnessed rapid
progress recently. However, most reasoning models heavily rely on shortcuts
learned from training data, which prevents their usage in challenging
real-world scenarios. In this paper, we propose a simple but effective
cross-modal contrastive learning strategy to get rid of the shortcut reasoning
caused by imbalanced annotations and improve the overall performance. Different
from existing contrastive learning with complex negative categories on coarse
(Image, Question, Answer) triplet level, we leverage the correspondences
between the language and image modalities to perform finer-grained cross-modal
contrastive learning. We treat each Question-Answer (QA) pair as a whole, and
differentiate between images that conform with it and those against it. To
alleviate the issue of sampling bias, we further build connected graphs among
images. For each positive pair, we regard the images from different graphs as
negative samples and deduct the version of multi-positive contrastive learning.
To our best knowledge, it is the first paper that reveals a general contrastive
learning strategy without delicate hand-craft rules can contribute to robust
VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our
superiority compared to the state of the arts. Code is available at
\url{https://github.com/qizhust/cmcl_vqa_pl}.
Related papers
- Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image
Translation [12.754320302262533]
We introduce a new negative Pruning technology for Unpaired image-to-image Translation (PUT) by sparsifying and ranking the patches.
The proposed algorithm is efficient, flexible and enables the model to learn essential information between corresponding patches stably.
arXiv Detail & Related papers (2022-04-23T08:31:18Z) - Adversarial Graph Contrastive Learning with Information Regularization [51.14695794459399]
Contrastive learning is an effective method in graph representation learning.
Data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples.
We propose a simple but effective method, Adversarial Graph Contrastive Learning (ARIEL)
It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets.
arXiv Detail & Related papers (2022-02-14T05:54:48Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - Warp Consistency for Unsupervised Learning of Dense Correspondences [116.56251250853488]
Key challenge in learning dense correspondences is lack of ground-truth matches for real image pairs.
We propose Warp Consistency, an unsupervised learning objective for dense correspondence regression.
Our approach sets a new state-of-the-art on several challenging benchmarks, including MegaDepth, RobotCar and TSS.
arXiv Detail & Related papers (2021-04-07T17:58:22Z) - Delving into Inter-Image Invariance for Unsupervised Visual
Representations [108.33534231219464]
We present a study to better understand the role of inter-image invariance learning.
Online labels converge faster than offline labels.
Semi-hard negative samples are more reliable and unbiased than hard negative samples.
arXiv Detail & Related papers (2020-08-26T17:44:23Z) - Learning to Compare Relation: Semantic Alignment for Few-Shot Learning [48.463122399494175]
We present a novel semantic alignment model to compare relations, which is robust to content misalignment.
We conduct extensive experiments on several few-shot learning datasets.
arXiv Detail & Related papers (2020-02-29T08:37:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.