Related papers: Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

URL: http://arxiv.org/abs/2112.07074v1
Date: Tue, 14 Dec 2021 00:20:55 GMT
Title: Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
Authors: Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
Abstract summary: We design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals. Experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks.
Score: 93.11954811297652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals for the joint training; (ii) We propose a novel gradient masking strategy to balance the parameter updates from the image and text pre-training losses. We evaluate the jointly pre-trained transformer by fine-tuning it on image classification tasks and natural language understanding tasks, respectively. The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

Related papers

Personalize Anything for Free with Diffusion Transformer [20.385520869825413]
Recent training-free approaches struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs) We uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. We propose textbfPersonalize Anything, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity.
arXiv Detail & Related papers (2025-03-16T17:51:16Z)
Image Generation from Image Captioning -- Invertible Approach [0.0]
We train an invertible model that learns a one-to-one mapping between the image and text embeddings. Once the invertible model is efficiently trained on one task, the image captioning, the same model can generate new images for a given text.
arXiv Detail & Related papers (2024-10-26T13:02:58Z)
Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation [25.253522756863727]
We propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks. We figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. We conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases.
arXiv Detail & Related papers (2024-06-30T12:13:34Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks. Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z)
Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z)
Image and Model Transformation with Secret Key for Vision Transformer [16.055655429920993]
We show for the first time that models trained with plain images can be directly transformed to models trained with encrypted images. The performance of the transformed models is the same as models trained with plain images when using test images encrypted with the key.
arXiv Detail & Related papers (2022-07-12T08:02:47Z)
Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z)
End-to-End Visual Editing with a Generatively Pre-Trained Artist [78.5922562526874]
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. We propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain. We show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture.
arXiv Detail & Related papers (2022-05-03T17:59:30Z)
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation. We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT) We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.