MAGMA -- Multimodal Augmentation of Generative Models through
Adapter-based Finetuning
- URL: http://arxiv.org/abs/2112.05253v1
- Date: Thu, 9 Dec 2021 23:58:45 GMT
- Title: MAGMA -- Multimodal Augmentation of Generative Models through
Adapter-based Finetuning
- Authors: Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia
Parcalabescu, Anette Frank
- Abstract summary: MAGMA is a simple method for augmenting generative language models with additional modalities using adapter-based finetuning.
We train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input.
- Score: 11.339580074756189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pretraining is fast becoming the norm in Vision-Language (VL)
modeling. However, prevailing VL approaches are limited by the requirement for
labeled data and the use of complex multi-step pretraining objectives. We
present MAGMA - a simple method for augmenting generative language models with
additional modalities using adapter-based finetuning. Building on Frozen, we
train a series of VL models that autoregressively generate text from arbitrary
combinations of visual and textual input. The pretraining is entirely
end-to-end using a single language modeling objective, simplifying optimization
compared to previous approaches. Importantly, the language model weights remain
unchanged during training, allowing for transfer of encyclopedic knowledge and
in-context learning abilities from language pretraining. MAGMA outperforms
Frozen on open-ended generative tasks, achieving state of the art results on
the OKVQA benchmark and competitive results on a range of other popular VL
benchmarks, while pretraining on 0.2% of the number of samples used to train
SimVLM.
Related papers
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.