LXMERT Model Compression for Visual Question Answering
- URL: http://arxiv.org/abs/2310.15325v1
- Date: Mon, 23 Oct 2023 19:46:41 GMT
- Title: LXMERT Model Compression for Visual Question Answering
- Authors: Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh
Eetemadi
- Abstract summary: We show that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.
Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.
- Score: 0.03749861135832073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pretrained models such as LXMERT are becoming popular for
learning cross-modal representations on text-image pairs for vision-language
tasks. According to the lottery ticket hypothesis, NLP and computer vision
models contain smaller subnetworks capable of being trained in isolation to
full performance. In this paper, we combine these observations to evaluate
whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA
task. In addition, we perform a model size cost-benefit analysis by
investigating how much pruning can be done without significant loss in
accuracy. Our experiment results demonstrate that LXMERT can be effectively
pruned by 40%-60% in size with 3% loss in accuracy.
Related papers
- One-Shot Pruning for Fast-adapting Pre-trained Models on Devices [28.696989086706186]
Large-scale pre-trained models have been remarkably successful in resolving downstream tasks.
deploying these models on low-capability devices still requires an effective approach, such as model pruning.
We present a scalable one-shot pruning method that leverages pruned knowledge of similar tasks to extract a sub-network from the pre-trained model for a new task.
arXiv Detail & Related papers (2023-07-10T06:44:47Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Quantifying lottery tickets under label noise: accuracy, calibration,
and complexity [6.232071870655069]
Pruning deep neural networks is a widely used strategy to alleviate the computational burden in machine learning.
We use the sparse double descent approach to identify univocally and characterise pruned models associated with classification tasks.
arXiv Detail & Related papers (2023-06-21T11:35:59Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Interpretations Steered Network Pruning via Amortized Inferred Saliency
Maps [85.49020931411825]
Convolutional Neural Networks (CNNs) compression is crucial to deploying these models in edge devices with limited resources.
We propose to address the channel pruning problem from a novel perspective by leveraging the interpretations of a model to steer the pruning process.
We tackle this challenge by introducing a selector model that predicts real-time smooth saliency masks for pruned models.
arXiv Detail & Related papers (2022-09-07T01:12:11Z) - A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z) - Multi-stage Pre-training over Simplified Multimodal Pre-training Models [35.644196343835674]
We propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages.
We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus.
Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.
arXiv Detail & Related papers (2021-07-22T03:35:27Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Playing Lottery Tickets with Vision and Language [62.6420670250559]
Large-scale transformer-based pre-training has revolutionized vision-and-language (V+L) research.
In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matchingworks that can achieve on par or even better performance than the dense networks when trained in isolation.
We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments.
arXiv Detail & Related papers (2021-04-23T22:24:33Z) - Seeing past words: Testing the cross-modal capabilities of pretrained
V&L models [18.73444918172383]
We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration.
We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT.
Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities.
arXiv Detail & Related papers (2020-12-22T21:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.