Text Rendering Strategies for Pixel Language Models
- URL: http://arxiv.org/abs/2311.00522v1
- Date: Wed, 1 Nov 2023 13:49:31 GMT
- Title: Text Rendering Strategies for Pixel Language Models
- Authors: Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott
- Abstract summary: In this paper, we investigate four approaches to rendering text in the PIXEL model.
We find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks.
Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias.
- Score: 21.36370101063954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pixel-based language models process text rendered as images, which allows
them to handle any script, making them a promising approach to open vocabulary
language modelling. However, recent approaches use text renderers that produce
a large set of almost-equivalent input patches, which may prove sub-optimal for
downstream tasks, due to redundancy in the input representations. In this
paper, we investigate four approaches to rendering text in the PIXEL model
(Rust et al., 2023), and find that simple character bigram rendering brings
improved performance on sentence-level tasks without compromising performance
on token-level or multilingual tasks. This new rendering strategy also makes it
possible to train a more compact model with only 22M parameters that performs
on par with the original 86M parameter model. Our analyses show that character
bigram rendering leads to a consistently better model but with an anisotropic
patch embedding space, driven by a patch frequency bias, highlighting the
connections between image patch- and tokenization-based language models.
Related papers
- An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation [43.139415423751615]
Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment.
A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task.
We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
arXiv Detail & Related papers (2024-08-16T10:33:19Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.