PaLI-3 Vision Language Models: Smaller, Faster, Stronger
- URL: http://arxiv.org/abs/2310.09199v2
- Date: Tue, 17 Oct 2023 22:38:51 GMT
- Title: PaLI-3 Vision Language Models: Smaller, Faster, Stronger
- Authors: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul
Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr
Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong,
Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
- Abstract summary: PaLI-3 is a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.
We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks.
- Score: 82.6453282241224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents PaLI-3, a smaller, faster, and stronger vision language
model (VLM) that compares favorably to similar models that are 10x larger. As
part of arriving at this strong performance, we compare Vision Transformer
(ViT) models pretrained using classification objectives to contrastively
(SigLIP) pretrained ones. We find that, while slightly underperforming on
standard image classification benchmarks, SigLIP-based PaLI shows superior
performance across various multimodal benchmarks, especially on localization
and visually-situated text understanding. We scale the SigLIP image encoder up
to 2 billion parameters, and achieves a new state-of-the-art on multilingual
cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles
research on fundamental pieces of complex VLMs, and could fuel a new generation
of scaled-up models.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - Training Vision-Language Transformers from Captions [80.00302205584335]
We introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders.
In a head-to-head comparison between ViLT and our model, we find that our approach outperforms ViLT on standard benchmarks.
arXiv Detail & Related papers (2022-05-19T00:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.