Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
Question Answering
- URL: http://arxiv.org/abs/2210.14558v2
- Date: Wed, 11 Oct 2023 18:28:27 GMT
- Title: Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
Question Answering
- Authors: Qingyi Si, Yuanxin Liu, Zheng Lin, Peng Fu and Weiping Wang
- Abstract summary: This paper investigates whether a vision-language pre-trained model can be compressed and debiased simultaneously by searching sparse and robustworks.
Our results show that there indeed exist sparse and robustworks, which are competitive with the debiased full.
vehicle.
- Score: 25.540831728925557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the excellent performance of vision-language pre-trained models
(VLPs) on conventional VQA task, they still suffer from two problems: First,
VLPs tend to rely on language biases in datasets and fail to generalize to
out-of-distribution (OOD) data. Second, they are inefficient in terms of memory
footprint and computation. Although promising progress has been made in both
problems, most existing works tackle them independently. To facilitate the
application of VLP to VQA tasks, it is imperative to jointly study VLP
compression and OOD robustness, which, however, has not yet been explored. This
paper investigates whether a VLP can be compressed and debiased simultaneously
by searching sparse and robust subnetworks. To this end, we systematically
study the design of a training and compression pipeline to search the
subnetworks, as well as the assignment of sparsity to different
modality-specific modules. Our experiments involve 3 VLPs, 2 compression
methods, 4 training methods, 2 datasets and a range of sparsity levels and
random seeds. Our results show that there indeed exist sparse and robust
subnetworks, which are competitive with the debiased full VLP and clearly
outperform the debiasing SoTAs with fewer parameters on OOD datasets VQA-CP v2
and VQA-VS. The codes can be found at
https://github.com/PhoebusSi/Compress-Robust-VQA.
Related papers
- Task Progressive Curriculum Learning for Robust Visual Question Answering [6.2175732887853545]
We show for the first time that robust Visual Question Answering is attainable by simply enhancing the training strategy.
Our proposed approach, Task Progressive Curriculum Learning (TPCL), breaks the main VQA problem into smaller, easier tasks.
We demonstrate TPCL effectiveness through a comprehensive evaluation on standard datasets.
arXiv Detail & Related papers (2024-11-26T10:29:47Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation.
PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.
Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - LPF: A Language-Prior Feedback Objective Function for De-biased Visual
Question Answering [11.845589863914853]
We propose a novel Language-Prior Feedback (LPF) objective function to re-balance the proportion of each answer's loss value in the total Visual Question Answering (VQA) loss.
We conduct extensive experiments and the results show that the LPF brings a significant improvement over various VQA models.
arXiv Detail & Related papers (2021-05-29T13:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.