PIXAR: Auto-Regressive Language Modeling in Pixel Space
- URL: http://arxiv.org/abs/2401.03321v2
- Date: Fri, 23 Feb 2024 19:06:35 GMT
- Title: PIXAR: Auto-Regressive Language Modeling in Pixel Space
- Authors: Yintao Tai, Xiyang Liao, Alessandro Suglia, Antonio Vergari
- Abstract summary: We introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation.
Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models.
To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI.
- Score: 51.530056034156374
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent work showed the possibility of building open-vocabulary large language
models (LLMs) that directly operate on pixel representations. These models are
implemented as autoencoders that reconstruct masked patches of rendered text.
However, these pixel-based LLMs are limited to discriminative tasks (e.g.,
classification) and, similar to BERT, cannot be used to generate text.
Therefore, they cannot be used for generative tasks such as free-form question
answering. In this work, we introduce PIXAR, the first pixel-based
autoregressive LLM that performs text generation. Consisting of only a decoder,
PIXAR can perform free-form generative tasks while keeping the number of
parameters on par with previous encoder-decoder models. Furthermore, we
highlight the challenges of generating text as non-noisy images and show this
is due to using a maximum likelihood objective. To overcome this problem, we
propose an adversarial pretraining stage that improves the readability and
accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to
GPT-2 on text generation tasks. This paves the way to build open-vocabulary
LLMs that operate on perceptual input only and calls into question the
necessity of the usual symbolic input representation, i.e., text as
(sub)tokens.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts.
Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation.
We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z) - Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [150.57983348059528]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts.
It can effectively generate desired concepts given only black-box access to T2I models.
Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z) - Text Rendering Strategies for Pixel Language Models [21.36370101063954]
In this paper, we investigate four approaches to rendering text in the PIXEL model.
We find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks.
Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias.
arXiv Detail & Related papers (2023-11-01T13:49:31Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Language Modelling with Pixels [29.976453396194053]
This paper introduces PIXEL, the Pixel-based of Language, which suffers from neither of these issues.
PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages.
We evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts.
arXiv Detail & Related papers (2022-07-14T15:20:36Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.