CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval
- URL: http://arxiv.org/abs/2208.09843v1
- Date: Sun, 21 Aug 2022 08:37:50 GMT
- Title: CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval
- Authors: Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li,
Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang
- Abstract summary: We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
- Score: 108.48540976175457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-Text Retrieval (ITR) is challenging in bridging visual and lingual
modalities. Contrastive learning has been adopted by most prior arts. Except
for limited amount of negative image-text pairs, the capability of constrastive
learning is restricted by manually weighting negative pairs as well as
unawareness of external knowledge. In this paper, we propose our novel Coupled
Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving
cross-modal representation. Firstly, a novel diversity-sensitive contrastive
learning (DCL) architecture is invented. We introduce dynamic dictionaries for
both modalities to enlarge the scale of image-text pairs, and
diversity-sensitiveness is achieved by adaptive negative pair weighting.
Furthermore, two branches are designed in CODER. One learns instance-level
embeddings from image/text, and it also generates pseudo online clustering
labels for its input image/text based on their embeddings. Meanwhile, the other
branch learns to query from commonsense knowledge graph to form concept-level
descriptors for both modalities. Afterwards, both branches leverage DCL to
align the cross-modal embedding spaces while an extra pseudo clustering label
prediction loss is utilized to promote concept-level representation learning
for the second branch. Extensive experiments conducted on two popular
benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms
the state-of-the-art approaches.
Related papers
- Dual-Level Cross-Modal Contrastive Clustering [4.083185193413678]
We propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC)
external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs.
The image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks.
arXiv Detail & Related papers (2024-09-06T18:49:45Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Image-Text Retrieval with Binary and Continuous Label Supervision [38.682970905704906]
This paper proposes an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS)
For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence.
For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall) to improve the correlation between the similarity scores predicted by the retrieval model and the continuous labels.
arXiv Detail & Related papers (2022-10-20T14:52:34Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.