Related papers: Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

URL: http://arxiv.org/abs/2303.11866v1
Date: Tue, 21 Mar 2023 14:12:08 GMT
Title: Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Authors: Zaid Khan and Yun Fu
Abstract summary: Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training. We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
Score: 60.26952378997713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.

Related papers

Improved Alignment of Modalities in Large Vision Language Models [1.4561960744147884]
We propose a training strategy of auto-regressive vision-language models. We propose four training stages for aligning the vision model with the language model. We also devise different attention masks for training transformer-based language models.
arXiv Detail & Related papers (2025-03-25T09:59:46Z)
Cross-model Control: Improving Multiple Large Language Models in One-time Training [34.98931804630706]
Cross-model Control (CMC) is a method that improves multiple large language models in one-time training. Based on this insight, we incorporate a tiny language model with a minimal number of parameters. We propose a novel token mapping strategy named PM-MinED to make this tiny language model applicable to models with different vocabularies.
arXiv Detail & Related papers (2024-10-23T06:52:09Z)
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models. We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z)
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models. We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z)
FILM: How can Few-Shot Image Classification Benefit from Pre-Trained Language Models? [14.582209994281374]
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples. We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
arXiv Detail & Related papers (2023-07-09T08:07:43Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Multitask Learning for Low Resource Spoken Language Understanding [26.106133114838215]
We train models on dual objectives with automatic speech recognition and intent classification or sentiment classification. Our models, although being of modest size, show improvements over models trained end-to-end on intent classification. We study the performance of the models in low-resource scenario by training the models with as few as one example per class.
arXiv Detail & Related papers (2022-11-24T16:38:17Z)
Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective [26.41585967095811]
Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training. Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN. Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification.
arXiv Detail & Related papers (2022-10-16T17:24:06Z)
Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning [70.81910984985683]
We propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model. The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.
arXiv Detail & Related papers (2020-04-08T06:18:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.