Flamingo: a Visual Language Model for Few-Shot Learning
- URL: http://arxiv.org/abs/2204.14198v1
- Date: Fri, 29 Apr 2022 16:29:01 GMT
- Title: Flamingo: a Visual Language Model for Few-Shot Learning
- Authors: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain
Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm
Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo
Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
- Abstract summary: We introduce Flamingo, a family of Visual Language Models (VLM) with this ability.
Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora.
We demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning.
- Score: 95.88782798074314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building models that can be rapidly adapted to numerous tasks using only a
handful of annotated examples is an open challenge for multimodal machine
learning research. We introduce Flamingo, a family of Visual Language Models
(VLM) with this ability. Flamingo models include key architectural innovations
to: (i) bridge powerful pretrained vision-only and language-only models, (ii)
handle sequences of arbitrarily interleaved visual and textual data, and (iii)
seamlessly ingest images or videos as inputs. Thanks to their flexibility,
Flamingo models can be trained on large-scale multimodal web corpora containing
arbitrarily interleaved text and images, which is key to endow them with
in-context few-shot learning capabilities. We perform a thorough evaluation of
the proposed Flamingo models, exploring and measuring their ability to rapidly
adapt to a variety of image and video understanding benchmarks. These include
open-ended tasks such as visual question-answering, where the model is prompted
with a question which it has to answer, captioning tasks, which evaluate the
ability to describe a scene or an event, and close-ended tasks such as multiple
choice visual question-answering. For tasks lying anywhere on this spectrum, we
demonstrate that a single Flamingo model can achieve a new state of the art for
few-shot learning, simply by prompting the model with task-specific examples.
On many of these benchmarks, Flamingo actually surpasses the performance of
models that are fine-tuned on thousands of times more task-specific data.
Related papers
- EVLM: An Efficient Vision-Language Model for Visual Understanding [18.794601813330715]
This paper proposes an efficient multi-modal language model to minimize computational costs.
Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
arXiv Detail & Related papers (2024-07-19T10:09:51Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models.
Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.
Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning [153.98100182439165]
We introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo.
By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters.
We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks.
arXiv Detail & Related papers (2023-02-09T18:57:56Z) - Multimodal Few-Shot Learning with Frozen Language Models [36.75551859968596]
We train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption.
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples.
arXiv Detail & Related papers (2021-06-25T21:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.