Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
- URL: http://arxiv.org/abs/2206.07643v1
- Date: Wed, 15 Jun 2022 16:41:29 GMT
- Title: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
- Authors: Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang,
Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan
Wang
- Abstract summary: We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
- Score: 170.85076677740292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language (VL) pre-training has recently received considerable
attention. However, most existing end-to-end pre-training approaches either
only aim to tackle VL tasks such as image-text retrieval, visual question
answering (VQA) and image captioning that test high-level understanding of
images, or only target region-level understanding for tasks such as phrase
grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based
transformER), a new VL model architecture that can seamlessly handle both these
types of tasks. Instead of having dedicated transformer layers for fusion after
the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by
inserting cross-attention into the image and text backbones, bringing gains in
terms of memory and performance. In addition, unlike previous work that is
either only pre-trained on image-text data or on fine-grained data with
box-level annotations, we present a two-stage pre-training strategy that uses
both these kinds of data efficiently: (i) coarse-grained pre-training based on
image-text data; followed by (ii) fine-grained pre-training based on
image-text-box data. We conduct comprehensive experiments on a wide range of VL
tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,
referring expression comprehension, and object detection. Using deep multimodal
fusion coupled with the two-stage pre-training, FIBER provides consistent
performance improvements over strong baselines across all tasks, often
outperforming methods using magnitudes more data. Code is available at
https://github.com/microsoft/FIBER.
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.