Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images!
- URL: http://arxiv.org/abs/2306.03932v1
- Date: Tue, 6 Jun 2023 18:00:47 GMT
- Title: Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images!
- Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan
Chandraker
- Abstract summary: SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
- Score: 103.09776737512077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finetuning a large vision language model (VLM) on a target dataset after
large scale pretraining is a dominant paradigm in visual question answering
(VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non
natural-image domains are orders of magnitude smaller than those for
general-purpose VQA. While collecting additional labels for specialized tasks
or domains can be challenging, unlabeled images are often available. We
introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning
large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset
to build a teacher model that can generate question-answer pseudolabels
directly conditioned on an image alone, allowing us to pseudolabel unlabeled
images. SelTDA then finetunes the initial VLM on the original dataset augmented
with freshly pseudolabeled images. We describe a series of experiments showing
that our self-taught data augmentation increases robustness to adversarially
searched questions, counterfactual examples and rephrasings, improves domain
generalization, and results in greater retention of numerical reasoning skills.
The proposed strategy requires no additional annotations or architectural
modifications, and is compatible with any modern encoder-decoder multimodal
transformer. Code available at https://github.com/codezakh/SelTDA.
Related papers
- ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining [25.680035174334886]
In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models.
We propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge.
Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities.
arXiv Detail & Related papers (2024-06-03T06:03:57Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Generative Visual Question Answering [0.0]
This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization.
We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion.
Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images.
arXiv Detail & Related papers (2023-07-18T05:30:23Z) - Discovering the Unknown Knowns: Turning Implicit Knowledge in the
Dataset into Explicit Training Examples for Visual Question Answering [18.33311267792116]
We find that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly.
We present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA.
arXiv Detail & Related papers (2021-09-13T16:56:43Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z) - Semantic Equivalent Adversarial Data Augmentation for Visual Question
Answering [65.54116210742511]
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN)
In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data.
We find that we not only improve the overall performance on VQAv2, but also can withstand adversarial attack effectively.
arXiv Detail & Related papers (2020-07-19T05:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.