Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
- URL: http://arxiv.org/abs/2210.03347v2
- Date: Thu, 15 Jun 2023 21:34:23 GMT
- Title: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
- Authors: Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian
Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina
Toutanova
- Abstract summary: We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
- Score: 58.70423899829642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with
diagrams to web pages with images and tables, to mobile apps with buttons and
forms. Perhaps due to this diversity, previous work has typically relied on
domain-specific recipes with limited sharing of the underlying data, model
architectures, and objectives. We present Pix2Struct, a pretrained
image-to-text model for purely visual language understanding, which can be
finetuned on tasks containing visually-situated language. Pix2Struct is
pretrained by learning to parse masked screenshots of web pages into simplified
HTML. The web, with its richness of visual elements cleanly reflected in the
HTML structure, provides a large source of pretraining data well suited to the
diversity of downstream tasks. Intuitively, this objective subsumes common
pretraining signals such as OCR, language modeling, image captioning. In
addition to the novel pretraining strategy, we introduce a variable-resolution
input representation and a more flexible integration of language and vision
inputs, where language prompts such as questions are rendered directly on top
of the input image. For the first time, we show that a single pretrained model
can achieve state-of-the-art results in six out of nine tasks across four
domains: documents, illustrations, user interfaces, and natural images.
Related papers
- Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts.
Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head.
We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.