PaLI: A Jointly-Scaled Multilingual Language-Image Model
- URL: http://arxiv.org/abs/2209.06794v4
- Date: Mon, 5 Jun 2023 17:55:12 GMT
- Title: PaLI: A Jointly-Scaled Multilingual Language-Image Model
- Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr
Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas
Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan
Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury,
Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos
Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu
Soricut
- Abstract summary: PaLI (Pathways Language and Image model) is a model that extends this approach to the joint modeling of language and vision.
We create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages.
- Score: 110.10710554358455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective scaling and a flexible task interface enable large language models
to excel at many tasks. We present PaLI (Pathways Language and Image model), a
model that extends this approach to the joint modeling of language and vision.
PaLI generates text based on visual and textual inputs, and with this interface
performs many vision, language, and multimodal tasks, in many languages. To
train PaLI, we make use of large pre-trained encoder-decoder language models
and Vision Transformers (ViTs). This allows us to capitalize on their existing
capabilities and leverage the substantial cost of training them. We find that
joint scaling of the vision and language components is important. Since
existing Transformers for language are much larger than their vision
counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the
benefits from even larger-capacity vision models. To train PaLI, we create a
large multilingual mix of pretraining tasks, based on a new image-text training
set containing 10B images and texts in over 100 languages. PaLI achieves
state-of-the-art in multiple vision and language tasks (such as captioning,
visual question-answering, scene-text understanding), while retaining a simple,
modular, and scalable design.
Related papers
- VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Vision Language Transformers: A Survey [0.9137554315375919]
Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform.
Recent research has adapted the pretrained transformer architecture introduced in citetvaswani 2017attention to vision language modeling.
Transformer models have greatly improved performance and versatility over previous vision language models.
arXiv Detail & Related papers (2023-07-06T19:08:56Z) - PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - VAuLT: Augmenting the Vision-and-Language Transformer with the
Propagation of Deep Language Representations [6.405005247717135]
We propose the Vision-and-Augmented-Language Transformer (VAuLT)
VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks.
We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs.
arXiv Detail & Related papers (2022-08-18T18:51:13Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.