Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
Unpaired Images and Text
- URL: http://arxiv.org/abs/2112.07074v1
- Date: Tue, 14 Dec 2021 00:20:55 GMT
- Title: Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
Unpaired Images and Text
- Authors: Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan
Yang, Matthew Brown
- Abstract summary: We design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads.
We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals.
Experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks.
- Score: 93.11954811297652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the possibility of building a unified foundation
model that can be adapted to both vision-only and text-only tasks. Starting
from BERT and ViT, we design a unified transformer consisting of
modality-specific tokenizers, a shared transformer encoder, and task-specific
output heads. To efficiently pre-train the proposed model jointly on unpaired
images and text, we propose two novel techniques: (i) We employ the
separately-trained BERT and ViT models as teachers and apply knowledge
distillation to provide additional, accurate supervision signals for the joint
training; (ii) We propose a novel gradient masking strategy to balance the
parameter updates from the image and text pre-training losses. We evaluate the
jointly pre-trained transformer by fine-tuning it on image classification tasks
and natural language understanding tasks, respectively. The experiments show
that the resultant unified foundation transformer works surprisingly well on
both the vision-only and text-only tasks, and the proposed knowledge
distillation and gradient masking strategy can effectively lift the performance
to approach the level of separately-trained models.
Related papers
- Image Generation from Image Captioning -- Invertible Approach [0.0]
We train an invertible model that learns a one-to-one mapping between the image and text embeddings.
Once the invertible model is efficiently trained on one task, the image captioning, the same model can generate new images for a given text.
arXiv Detail & Related papers (2024-10-26T13:02:58Z) - Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation [25.253522756863727]
We propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks.
We figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them.
We conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases.
arXiv Detail & Related papers (2024-06-30T12:13:34Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z) - Image and Model Transformation with Secret Key for Vision Transformer [16.055655429920993]
We show for the first time that models trained with plain images can be directly transformed to models trained with encrypted images.
The performance of the transformed models is the same as models trained with plain images when using test images encrypted with the key.
arXiv Detail & Related papers (2022-07-12T08:02:47Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - End-to-End Visual Editing with a Generatively Pre-Trained Artist [78.5922562526874]
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change.
We propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain.
We show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture.
arXiv Detail & Related papers (2022-05-03T17:59:30Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.