FILIP: Fine-grained Interactive Language-Image Pre-Training
- URL: http://arxiv.org/abs/2111.07783v1
- Date: Tue, 9 Nov 2021 17:15:38 GMT
- Title: FILIP: Fine-grained Interactive Language-Image Pre-Training
- Authors: Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu,
Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu
- Abstract summary: Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
- Score: 106.19474076935363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised large-scale vision-language pre-training has shown promising
advances on various downstream tasks. Existing methods often model the
cross-modal interaction either via the similarity of the global feature of each
modality which misses sufficient information, or finer-grained interactions
using cross/self-attention upon visual and textual tokens. However,
cross/self-attention suffers from inferior efficiency in both training and
inference. In this paper, we introduce a large-scale Fine-grained Interactive
Language-Image Pre-training (FILIP) to achieve finer-level alignment through a
cross-modal late interaction mechanism, which uses a token-wise maximum
similarity between visual and textual tokens to guide the contrastive
objective. FILIP successfully leverages the finer-grained expressiveness
between image patches and textual words by modifying only contrastive loss,
while simultaneously gaining the ability to pre-compute image and text
representations offline at inference, keeping both large-scale training and
inference efficient. Furthermore, we construct a new large-scale image-text
pair dataset called FILIP300M for pre-training. Experiments show that FILIP
achieves state-of-the-art performance on multiple downstream vision-language
tasks including zero-shot image classification and image-text retrieval. The
visualization on word-patch alignment further shows that FILIP can learn
meaningful fine-grained features with promising localization ability.
Related papers
- Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.