Contrastive Cross-Modal Knowledge Sharing Pre-training for
Vision-Language Representation Learning and Retrieval
- URL: http://arxiv.org/abs/2207.00733v1
- Date: Sat, 2 Jul 2022 04:08:44 GMT
- Title: Contrastive Cross-Modal Knowledge Sharing Pre-training for
Vision-Language Representation Learning and Retrieval
- Authors: Keyu Wen, Zhenshan Tan, Qingrong Cheng, Cheng Chen, and Xiaodong Gu
- Abstract summary: A Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) is developed to grasp the joint text-image representations.
The first module is a weight-sharing transformer that builds on the head of the visual and textual encoders.
The other one is three specially designed contrastive learning, aiming to share knowledge between different models.
- Score: 12.30468719055037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the cross-modal pre-training task has been a hotspot because of its
wide application in various down-streaming researches including retrieval,
captioning, question answering and so on. However, exiting methods adopt a
one-stream pre-training model to explore the united vision-language
representation for conducting cross-modal retrieval, which easily suffer from
the calculation explosion. Moreover, although the conventional double-stream
structures are quite efficient, they still lack the vital cross-modal
interactions, resulting in low performances. Motivated by these challenges, we
put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE)
to grasp the joint text-image representations. Structurally, COOKIE adopts the
traditional double-stream structure because of the acceptable time consumption.
To overcome the inherent defects of double-stream structure as mentioned above,
we elaborately design two effective modules. Concretely, the first module is a
weight-sharing transformer that builds on the head of the visual and textual
encoders, aiming to semantically align text and image. This design enables
visual and textual paths focus on the same semantics. The other one is three
specially designed contrastive learning, aiming to share knowledge between
different models. The shared cross-modal knowledge develops the study of
unimodal representation greatly, promoting the single-modal retrieval tasks.
Extensive experimental results on multi-modal matching researches that includes
cross-modal retrieval, text matching, and image retrieval reveal the superiors
in calculation efficiency and statistical indicators of our pre-training model.
Related papers
- Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.