LPF: A Language-Prior Feedback Objective Function for De-biased Visual
Question Answering
- URL: http://arxiv.org/abs/2105.14300v1
- Date: Sat, 29 May 2021 13:48:11 GMT
- Title: LPF: A Language-Prior Feedback Objective Function for De-biased Visual
Question Answering
- Authors: Zujie Liang, Haifeng Hu and Jiaying Zhu
- Abstract summary: We propose a novel Language-Prior Feedback (LPF) objective function to re-balance the proportion of each answer's loss value in the total Visual Question Answering (VQA) loss.
We conduct extensive experiments and the results show that the LPF brings a significant improvement over various VQA models.
- Score: 11.845589863914853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing Visual Question Answering (VQA) systems tend to overly rely on
language bias and hence fail to reason from the visual clue. To address this
issue, we propose a novel Language-Prior Feedback (LPF) objective function, to
re-balance the proportion of each answer's loss value in the total VQA loss.
The LPF firstly calculates a modulating factor to determine the language bias
using a question-only branch. Then, the LPF assigns a self-adaptive weight to
each training sample in the training process. With this reweighting mechanism,
the LPF ensures that the total VQA loss can be reshaped to a more balanced
form. By this means, the samples that require certain visual information to
predict will be efficiently used during training. Our method is simple to
implement, model-agnostic, and end-to-end trainable. We conduct extensive
experiments and the results show that the LPF (1) brings a significant
improvement over various VQA models, (2) achieves competitive performance on
the bias-sensitive VQA-CP v2 benchmark.
Related papers
- Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
Question Answering [25.540831728925557]
This paper investigates whether a vision-language pre-trained model can be compressed and debiased simultaneously by searching sparse and robustworks.
Our results show that there indeed exist sparse and robustworks, which are competitive with the debiased full.
vehicle.
arXiv Detail & Related papers (2022-10-26T08:25:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.