Self-Training Vision Language BERTs with a Unified Conditional Model
- URL: http://arxiv.org/abs/2201.02010v1
- Date: Thu, 6 Jan 2022 11:00:52 GMT
- Title: Self-Training Vision Language BERTs with a Unified Conditional Model
- Authors: Xiaofeng Yang, Fengmao Lv, Fayao Liu, Guosheng Lin
- Abstract summary: We propose a self-training approach that allows training VL-BERTs from unlabeled image data.
We use the labeled image data to train a teacher model and use the trained model to generate pseudo captions on unlabeled image data.
By using the proposed self-training approach and only 300k unlabeled extra data, we are able to get competitive or even better performances.
- Score: 51.11025371762571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language BERTs are trained with language corpus in a self-supervised
manner. Unlike natural language BERTs, vision language BERTs need paired data
to train, which restricts the scale of VL-BERT pretraining. We propose a
self-training approach that allows training VL-BERTs from unlabeled image data.
The proposed method starts with our unified conditional model -- a vision
language BERT model that can perform zero-shot conditional generation. Given
different conditions, the unified conditional model can generate captions,
dense captions, and even questions. We use the labeled image data to train a
teacher model and use the trained model to generate pseudo captions on
unlabeled image data. We then combine the labeled data and pseudo labeled data
to train a student model. The process is iterated by putting the student model
as a new teacher. By using the proposed self-training approach and only 300k
unlabeled extra data, we are able to get competitive or even better
performances compared to the models of similar model size trained with 3
million extra image data.
Related papers
- Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi [0.7874708385247353]
This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
arXiv Detail & Related papers (2022-11-21T05:15:48Z) - A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision [32.4697157553247]
In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
arXiv Detail & Related papers (2021-12-27T20:02:10Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z) - What the [MASK]? Making Sense of Language-Specific BERT Models [39.54532211263058]
This paper presents the current state of the art in language-specific BERT models.
Our aim is to provide an overview of the commonalities and differences between Language-language-specific BERT models and mBERT models.
arXiv Detail & Related papers (2020-03-05T20:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.