Related papers: Text Rendering Strategies for Pixel Language Models

Text Rendering Strategies for Pixel Language Models

URL: http://arxiv.org/abs/2311.00522v1
Date: Wed, 1 Nov 2023 13:49:31 GMT
Title: Text Rendering Strategies for Pixel Language Models
Authors: Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott
Abstract summary: In this paper, we investigate four approaches to rendering text in the PIXEL model. We find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias.
Score: 21.36370101063954
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z)
An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation [43.139415423751615]
Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
arXiv Detail & Related papers (2024-08-16T10:33:19Z)
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this paper, we propose an end-to-end IIMT model consisting of four modules. Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain. We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z)
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem. Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.