RankCLIP: Ranking-Consistent Language-Image Pretraining
- URL: http://arxiv.org/abs/2404.09387v2
- Date: Thu, 20 Jun 2024 16:20:37 GMT
- Title: RankCLIP: Ranking-Consistent Language-Image Pretraining
- Authors: Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun,
- Abstract summary: RANKCLIP is a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP.
By extending the traditional pair-wise loss to list-wise, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality.
- Score: 7.92247304974314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
Related papers
- CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training [17.27516384073838]
We propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning.
CMAL achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks.
arXiv Detail & Related papers (2024-10-16T14:12:26Z) - Constrained Multi-Layer Contrastive Learning for Implicit Discourse Relationship Recognition [9.644808110078833]
We adapt a supervised contrastive learning (CL) method, label- and instance-centered CL, to enhance representation learning.
Experimental results on PDTB 2.0 and PDTB 3.0 show that our approach can significantly improve the performance on both multi-class classification and binary classification.
arXiv Detail & Related papers (2024-09-07T17:55:41Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly
Supervised Relation Extraction [24.853265244512954]
We propose a hierarchical contrastive learning Framework for DistantlySupervised relation extraction (HiCLRE) to reduce noisy sentences.
Specifically, we propose a three-level hierarchical learning framework to interact with cross levels, generating the de-noising context-aware representations.
Experiments demonstrate that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.
arXiv Detail & Related papers (2022-02-27T12:48:26Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Contrastive Learning with Adversarial Examples [79.39156814887133]
Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations.
This paper introduces a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE.
arXiv Detail & Related papers (2020-10-22T20:45:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.