QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
- URL: http://arxiv.org/abs/2504.03337v1
- Date: Fri, 04 Apr 2025 10:38:28 GMT
- Title: QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
- Authors: Quanxing Xu, Ling Zhou, Xian Zhong, Feifei Zhang, Rubing Huang, Chia-Wen Lin,
- Abstract summary: Current debiasing techniques fail to capture the superior relation between images and texts.<n>No prior work has examined the degree of input relevance in debiasing studies.<n>We propose a novel framework, which employs a generation-based self-supervised learning strategy.
- Score: 38.24779287568123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing debiasing approaches in Visual Question Answering (VQA) primarily focus on enhancing visual learning, integrating auxiliary models, or employing data augmentation strategies. However, these methods exhibit two major drawbacks. First, current debiasing techniques fail to capture the superior relation between images and texts because prevalent learning frameworks do not enable models to extract deeper correlations from highly contrasting samples. Second, they do not assess the relevance between the input question and image during inference, as no prior work has examined the degree of input relevance in debiasing studies. Motivated by these limitations, we propose a novel framework, Optimized Question-Image Relation Learning (QIRL), which employs a generation-based self-supervised learning strategy. Specifically, two modules are introduced to address the aforementioned issues. The Negative Image Generation (NIG) module automatically produces highly irrelevant question-image pairs during training to enhance correlation learning, while the Irrelevant Sample Identification (ISI) module improves model robustness by detecting and filtering irrelevant inputs, thereby reducing prediction errors. Furthermore, to validate our concept of reducing output errors through filtering unrelated question-image inputs, we propose a specialized metric to evaluate the performance of the ISI module. Notably, our approach is model-agnostic and can be integrated with various VQA models. Extensive experiments on VQA-CPv2 and VQA-v2 demonstrate the effectiveness and generalization ability of our method. Among data augmentation strategies, our approach achieves state-of-the-art results.
Related papers
- FakeReasoning: Towards Generalizable Forgery Detection and Reasoning [24.8865218866598]
We propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task)<n>We introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models.<n>We also propose FakeReasoning, a forgery detection and reasoning framework with two key components.
arXiv Detail & Related papers (2025-03-27T06:54:06Z) - Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers.
We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question.
In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z) - Robustness Evaluation of Stacked Generative Adversarial Networks using
Metamorphic Testing [0.39146761527401414]
StackGAN-v2 has proven capable of generating high resolution images that reflect the details specified in the input text descriptions.
We adopt Metamorphic Testing technique to evaluate the robustness of the model with a variety of unexpected training dataset.
We find that StackGAN-v2 algorithm is susceptible to input images with obtrusive objects, even if it overlaps with the main object minimally.
arXiv Detail & Related papers (2021-03-04T07:29:17Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.