BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile
Screenshot Captioning
- URL: http://arxiv.org/abs/2309.14774v1
- Date: Tue, 26 Sep 2023 09:16:44 GMT
- Title: BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile
Screenshot Captioning
- Authors: Ching-Yu Chiang, I-Hua Chang, Shih-Wei Liao
- Abstract summary: This study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model.
By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved.
- Score: 0.5893124686141781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to explore efficient tuning methods for the screenshot
captioning task. Recently, image captioning has seen significant advancements,
but research in captioning tasks for mobile screens remains relatively scarce.
Current datasets and use cases describing user behaviors within product
screenshots are notably limited. Consequently, we sought to fine-tune
pre-existing models for the screenshot captioning task. However, fine-tuning
large pre-trained models can be resource-intensive, requiring considerable
time, computational power, and storage due to the vast number of parameters in
image captioning models. To tackle this challenge, this study proposes a
combination of adapter methods, which necessitates tuning only the additional
modules on the model. These methods are originally designed for vision or
language tasks, and our intention is to apply them to address similar
challenges in screenshot captioning. By freezing the parameters of the image
caption models and training only the weights associated with the methods,
performance comparable to fine-tuning the entire model can be achieved, while
significantly reducing the number of parameters. This study represents the
first comprehensive investigation into the effectiveness of combining adapters
within the context of the screenshot captioning task. Through our experiments
and analyses, this study aims to provide valuable insights into the application
of adapters in vision-language models and contribute to the development of
efficient tuning techniques for the screenshot captioning task. Our study is
available at https://github.com/RainYuGG/BLIP-Adapter
Related papers
- ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Better Captioning with Sequence-Level Exploration [60.57850194028581]
We show the limitation of the current sequence-level learning objective for captioning tasks.
In theory, we show that the current objective is equivalent to only optimizing the precision side of the caption set.
Empirical result shows that the model trained by this objective tends to get lower score on the recall side.
arXiv Detail & Related papers (2020-03-08T09:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.