A Comprehensive Empirical Study of Vision-Language Pre-trained Model for
Supervised Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2201.02772v1
- Date: Sat, 8 Jan 2022 06:00:22 GMT
- Title: A Comprehensive Empirical Study of Vision-Language Pre-trained Model for
Supervised Cross-Modal Retrieval
- Authors: Zhixiong Zeng and Wenji Mao
- Abstract summary: Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval.
We take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study.
We propose a novel model CLIP4CMR that employs pre-trained CLIP as backbone network to perform supervised CMR.
- Score: 19.2650103482509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-Modal Retrieval (CMR) is an important research topic across multimodal
computing and information retrieval, which takes one type of data as the query
to retrieve relevant data of another type, and has been widely used in many
real-world applications. Recently, the vision-language pre-trained model
represented by CLIP has demonstrated its superiority of learning visual and
textual representations and its impressive performance on various vision and
language related tasks. Although CLIP as well as the previous pre-trained
models have shown great performance improvement in unsupervised CMR, the
performance and impact of these pre-trained models on supervised CMR were
rarely explored due to the lack of multimodal class-level associations.
In this paper, we take CLIP as the current representative vision-language
pre-trained model to conduct a comprehensive empirical study and provide
insights on its performance and impact on supervised CMR. To this end, we first
propose a novel model CLIP4CMR (\textbf{CLIP For} supervised
\textbf{C}ross-\textbf{M}odal \textbf{R}etrieval) that employs pre-trained CLIP
as backbone network to perform supervised CMR. We then revisit the existing
loss function design in CMR, including the most common pair-wise losses,
class-wise losses and hybrid ones, and provide insights on applying CLIP.
Moreover, we investigate several concerned issues in supervised CMR and provide
new perspectives for this field via CLIP4CMR, including the robustness to
modality imbalance and the sensitivity to hyper-parameters. Extensive
experimental results show that the CLIP4CMR achieves SOTA results with
significant improvements on the benchmark datasets Wikipedia, NUS-WIDE,
Pascal-Sentence and XmediaNet. Our data and codes are publicly available at
https://github.com/zhixiongz/CLIP4CMR.
Related papers
- Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Theoretical Insights into Overparameterized Models in Multi-Task and Replay-Based Continual Learning [37.745896674964186]
Multi-task learning (MTL) aims to improve the generalization performance of a model on multiple related tasks by training it simultaneously on those tasks.
Continual learning (CL) involves adapting to new sequentially arriving tasks over time without forgetting the previously acquired knowledge.
We develop theoretical results describing the effect of various system parameters on the model's performance in an MTL setup.
Our results reveal the impact of buffer size and model capacity on the forgetting rate in a CL setup and help shed light on some of the state-of-the-art CL methods.
arXiv Detail & Related papers (2024-08-29T23:22:40Z) - URRL-IMVC: Unified and Robust Representation Learning for Incomplete Multi-View Clustering [28.776476995363048]
We propose a novel Unified and Representation Learning for Incomplete Multi-View Clustering (URRL-IMVC)
URRL-IMVC directly learns a unified embedding that is robust to view missing conditions by integrating information from multiple views and neighboring samples.
We extensively evaluate the proposed URRL-IMVC framework on various benchmark datasets, demonstrating its state-of-the-art performance.
arXiv Detail & Related papers (2024-07-12T09:35:25Z) - Theory on Mixture-of-Experts in Continual Learning [72.42497633220547]
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time.
Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks.
MoE model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network.
arXiv Detail & Related papers (2024-06-24T08:29:58Z) - What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - Learning Deep Representations via Contrastive Learning for Instance
Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL)
In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z) - Interventional Contrastive Learning with Meta Semantic Regularizer [28.708395209321846]
Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner.
When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas.
When the CL model is trained with foreground areas, the performance tested in full images is worse than that in foreground areas.
arXiv Detail & Related papers (2022-06-29T15:02:38Z) - Competence-based Multimodal Curriculum Learning for Medical Report
Generation [98.10763792453925]
We propose a Competence-based Multimodal Curriculum Learning framework ( CMCL) to alleviate the data bias and make best use of available data.
Specifically, CMCL simulates the learning process of radiologists and optimize the model in a step by step manner.
Experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
arXiv Detail & Related papers (2022-06-24T08:16:01Z) - On Continual Model Refinement in Out-of-Distribution Data Streams [64.62569873799096]
Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams.
Existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario.
We propose a new CL problem formulation dubbed continual model refinement (CMR)
arXiv Detail & Related papers (2022-05-04T11:54:44Z) - The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI.
We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts.
We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.