Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment
- URL: http://arxiv.org/abs/2302.00902v2
- Date: Fri, 3 Feb 2023 05:06:46 GMT
- Title: Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment
- Authors: Hao Liu, Wilson Yan, Pieter Abbeel
- Abstract summary: Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
- Score: 81.73717488887938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in scaling up large language models has shown impressive
capabilities in performing few-shot learning across a wide range of text-based
tasks. However, a key limitation is that these language models fundamentally
lack visual perception - a crucial attribute needed to extend these models to
be able to interact with the real world and solve vision tasks, such as in
visual-question answering and robotics. Prior works have largely connected
image to text through pretraining and/or fine-tuning on curated image-text
datasets, which can be a costly and expensive process. In order to resolve this
limitation, we propose a simple yet effective approach called
Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to
align text-image data in an unsupervised manner by leveraging pretrained
language models (e.g., BERT, RoBERTa). Our main idea is to encode image as
sequences of text tokens by directly quantizing image embeddings using a
pretrained language codebook. We then apply random masking followed by a BERT
model, and have the decoder reconstruct the original image from BERT predicted
text token embeddings. By doing so, LQAE learns to represent similar images
with similar clusters of text tokens, thereby aligning these two modalities
without the use of aligned text-image pairs. This enables few-shot image
classification with large language models (e.g., GPT-3) as well as linear
classification of images based on BERT text features. To the best of our
knowledge, our work is the first work that uses unaligned images for multimodal
tasks by leveraging the power of pretrained language models.
Related papers
- Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [17.861540412002967]
We propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation.
In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator.
Our method significantly outperforms optimization-based text-to-image methods in terms of image quality.
arXiv Detail & Related papers (2022-03-01T12:11:32Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.