Language Modelling with Pixels
- URL: http://arxiv.org/abs/2207.06991v2
- Date: Wed, 26 Apr 2023 15:27:35 GMT
- Title: Language Modelling with Pixels
- Authors: Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky,
Miryam de Lhoneux, Desmond Elliott
- Abstract summary: This paper introduces PIXEL, the Pixel-based of Language, which suffers from neither of these issues.
PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages.
We evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts.
- Score: 29.976453396194053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust than BERT to orthographic
attacks and linguistic code-switching, further confirming the benefits of
modelling language with pixels.
Related papers
- Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models [7.356870418870544]
Pixel-based language models have emerged as a compelling alternative to subword-based language modelling.
PIXEL is a vision transformer that has been pre-trained on rendered text.
arXiv Detail & Related papers (2024-10-15T19:21:23Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - PIXAR: Auto-Regressive Language Modeling in Pixel Space [51.530056034156374]
We introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation.
Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models.
To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI.
arXiv Detail & Related papers (2024-01-06T22:49:38Z) - Text Rendering Strategies for Pixel Language Models [21.36370101063954]
In this paper, we investigate four approaches to rendering text in the PIXEL model.
We find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks.
Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias.
arXiv Detail & Related papers (2023-11-01T13:49:31Z) - Multilingual Pixel Representations for Translation and Effective
Cross-lingual Transfer [25.575718310334643]
We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations.
We explore various properties of pixel representations such as parameter sharing within and across scripts to better understand where they lead to positive transfer.
We observe that these properties not only enable seamless cross-lingual transfer to unseen scripts, but make pixel representations more data-efficient than alternatives such as vocabulary expansion.
arXiv Detail & Related papers (2023-05-23T17:26:50Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.