LiT: Zero-Shot Transfer with Locked-image Text Tuning
- URL: http://arxiv.org/abs/2111.07991v1
- Date: Mon, 15 Nov 2021 18:53:48 GMT
- Title: LiT: Zero-Shot Transfer with Locked-image Text Tuning
- Authors: Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel
Keysers, Alexander Kolesnikov, Lucas Beyer
- Abstract summary: "Locked-image Text tuning" (LiT-tuning) teaches a text model to read out good representations from a pre-trained image model for new tasks.
A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval.
- Score: 68.78877201319811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents contrastive-tuning, a simple method employing contrastive
training to align image and text models while still taking advantage of their
pre-training. In our empirical study we find that locked pre-trained image
models with unlocked text models work best. We call this instance of
contrastive-tuning "Locked-image Text tuning" (LiT-tuning), which just teaches
a text model to read out good representations from a pre-trained image model
for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to
new vision tasks, such as image classification or retrieval. The proposed
LiT-tuning is widely applicable; it works reliably with multiple pre-training
methods (supervised and unsupervised) and across diverse architectures (ResNet,
Vision Transformers and MLP-Mixer) using three different image-text datasets.
With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned model
achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1%
on the challenging out-of-distribution ObjectNet test set.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed
Image Retrieval [17.70430913227593]
We introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task.
With such a simple design, it can learn to capture fine-grained text-guided modifications.
arXiv Detail & Related papers (2023-11-13T02:49:57Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching [49.730741713652435]
In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model.
For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model.
To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
arXiv Detail & Related papers (2023-01-07T17:24:11Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
Unpaired Images and Text [93.11954811297652]
We design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads.
We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals.
Experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks.
arXiv Detail & Related papers (2021-12-14T00:20:55Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.