COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2204.07441v1
- Date: Fri, 15 Apr 2022 12:34:47 GMT
- Title: COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval
- Authors: Haoyu Lu and Nanyi Fei and Yuqi Huo and Yizhao Gao and Zhiwu Lu and
Ji-Rong Wen
- Abstract summary: We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
- Score: 59.15034487974549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale single-stream pre-training has shown dramatic performance in
image-text retrieval. Regrettably, it faces low inference efficiency due to
heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with
high inference efficiency have also shown promising performance, however, they
only consider instance-level alignment between the two streams (thus there is
still room for improvement). To overcome these limitations, we propose a novel
COllaborative Two-Stream vision-language pretraining model termed COTS for
image-text retrieval by enhancing cross-modal interaction. In addition to
instance level alignment via momentum contrastive learning, we leverage two
extra levels of cross-modal interactions in our COTS: (1) Token-level
interaction - a masked visionlanguage modeling (MVLM) learning objective is
devised without using a cross-stream network module, where variational
autoencoder is imposed on the visual encoder to generate visual tokens for each
image. (2) Task-level interaction - a KL-alignment learning objective is
devised between text-to-image and image-to-text retrieval tasks, where the
probability distribution per task is computed with the negative queues in
momentum contrastive learning. Under a fair comparison setting, our COTS
achieves the highest performance among all two-stream methods and comparable
performance (but with 10,800X faster in inference) w.r.t. the latest
single-stream methods. Importantly, our COTS is also applicable to
text-to-video retrieval, yielding new state-ofthe-art on the widely-used
MSR-VTT dataset.
Related papers
- LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Contrastive Cross-Modal Knowledge Sharing Pre-training for
Vision-Language Representation Learning and Retrieval [12.30468719055037]
A Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) is developed to grasp the joint text-image representations.
The first module is a weight-sharing transformer that builds on the head of the visual and textual encoders.
The other one is three specially designed contrastive learning, aiming to share knowledge between different models.
arXiv Detail & Related papers (2022-07-02T04:08:44Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple
Levels [35.57369098866317]
Vision-language pre-training on large-scale image-text pairs has witnessed rapid progress for learning cross-modal representations.
We propose a new pre-training method which jointly aligns both the low-level and high-level semantics between image and text representations.
arXiv Detail & Related papers (2021-03-14T02:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.