Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem
- URL: http://arxiv.org/abs/2207.11850v1
- Date: Sun, 24 Jul 2022 23:50:52 GMT
- Title: Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem
- Authors: Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, Yan Yan
- Abstract summary: We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
- Score: 60.0878532426877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several studies have recently pointed that existing Visual Question Answering
(VQA) models heavily suffer from the language prior problem, which refers to
capturing superficial statistical correlations between the question type and
the answer whereas ignoring the image contents. Numerous efforts have been
dedicated to strengthen the image dependency by creating the delicate models or
introducing the extra visual annotations. However, these methods cannot
sufficiently explore how the visual cues explicitly affect the learned answer
representation, which is vital for language reliance alleviation. Moreover,
they generally emphasize the class-level discrimination of the learned answer
representation, which overlooks the more fine-grained instance-level patterns
and demands further optimization. In this paper, we propose a novel
collaborative learning scheme from the viewpoint of visual perturbation
calibration, which can better investigate the fine-grained visual effects and
mitigate the language prior problem by learning the instance-level
characteristics. Specifically, we devise a visual controller to construct two
sorts of curated images with different perturbation extents, based on which the
collaborative learning of intra-instance invariance and inter-instance
discrimination is implemented by two well-designed discriminators. Besides, we
implement the information bottleneck modulator on latent space for further bias
alleviation and representation calibration. We impose our visual
perturbation-aware framework to three orthodox baselines and the experimental
results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its
effectiveness. In addition, we also justify its robustness on the balanced VQA
benchmark.
Related papers
- Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Focalized Contrastive View-invariant Learning for Self-supervised
Skeleton-based Action Recognition [16.412306012741354]
We propose a self-supervised framework called Focalized Contrastive View-invariant Learning (FoCoViL)
FoCoViL significantly suppresses the view-specific information on the representation space where the viewpoints are coarsely aligned.
It associates actions with common view-invariant properties and simultaneously separates the dissimilar ones.
arXiv Detail & Related papers (2023-04-03T10:12:30Z) - Image Difference Captioning with Pre-training and Contrastive Learning [45.59621065755761]
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.
The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations.
We propose a new modeling framework following the pre-training-finetuning paradigm to address these challenges.
arXiv Detail & Related papers (2022-02-09T06:14:22Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.