BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation
- URL: http://arxiv.org/abs/2201.12086v1
- Date: Fri, 28 Jan 2022 12:49:48 GMT
- Title: BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation
- Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
- Abstract summary: We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
- Score: 86.4572981982407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Pre-training (VLP) has advanced the performance for many
vision-language tasks. However, most existing pre-trained models only excel in
either understanding-based tasks or generation-based tasks. Furthermore,
performance improvement has been largely achieved by scaling up the dataset
with noisy image-text pairs collected from the web, which is a suboptimal
source of supervision. In this paper, we propose BLIP, a new VLP framework
which transfers flexibly to both vision-language understanding and generation
tasks. BLIP effectively utilizes the noisy web data by bootstrapping the
captions, where a captioner generates synthetic captions and a filter removes
the noisy ones. We achieve state-of-the-art results on a wide range of
vision-language tasks, such as image-text retrieval (+2.7% in average
recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).
BLIP also demonstrates strong generalization ability when directly transferred
to video-language tasks in a zero-shot manner. Code, models, and datasets are
released at https://github.com/salesforce/BLIP.
Related papers
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning [22.93684323791136]
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.
We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance ICCC's zero-shot performance without the need for labeled task.
Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based tasks through ICCC instruction tuning.
arXiv Detail & Related papers (2024-04-01T04:28:01Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.