PaLI-X: On Scaling up a Multilingual Vision and Language Model
- URL: http://arxiv.org/abs/2305.18565v1
- Date: Mon, 29 May 2023 18:58:38 GMT
- Title: PaLI-X: On Scaling up a Multilingual Vision and Language Model
- Authors: Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit
Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi
Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael
Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee
Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias
Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas
Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel
Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov,
Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu
Soricut
- Abstract summary: We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
- Score: 166.9837904115951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the training recipe and results of scaling up PaLI-X, a
multilingual vision and language model, both in terms of size of the components
and the breadth of its training task mixture. Our model achieves new levels of
performance on a wide-range of varied and complex tasks, including multiple
image-based captioning and question-answering tasks, image-based document
understanding and few-shot (in-context) learning, as well as object detection,
video question answering, and video captioning. PaLI-X advances the
state-of-the-art on most vision-and-language benchmarks considered (25+ of
them). Finally, we observe emerging capabilities, such as complex counting and
multilingual object detection, tasks that are not explicitly in the training
mix.
Related papers
- VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning [27.544311403607786]
We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs)
Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes.
In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
arXiv Detail & Related papers (2023-10-12T09:39:17Z) - PaLI: A Jointly-Scaled Multilingual Language-Image Model [110.10710554358455]
PaLI (Pathways Language and Image model) is a model that extends this approach to the joint modeling of language and vision.
We create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages.
arXiv Detail & Related papers (2022-09-14T17:24:07Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - All-in-One Image-Grounded Conversational Agents [31.28974522911758]
We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module.
We provide a thorough analysis of the components of the model, and transfer performance when training on one, some, or all of the tasks.
arXiv Detail & Related papers (2019-12-28T03:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.