Image-Text Retrieval with Binary and Continuous Label Supervision
- URL: http://arxiv.org/abs/2210.11319v1
- Date: Thu, 20 Oct 2022 14:52:34 GMT
- Title: Image-Text Retrieval with Binary and Continuous Label Supervision
- Authors: Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Ying Jin, Yufeng
Zhang
- Abstract summary: This paper proposes an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS)
For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence.
For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall) to improve the correlation between the similarity scores predicted by the retrieval model and the continuous labels.
- Score: 38.682970905704906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most image-text retrieval work adopts binary labels indicating whether a pair
of image and text matches or not. Such a binary indicator covers only a limited
subset of image-text semantic relations, which is insufficient to represent
relevance degrees between images and texts described by continuous labels such
as image captions. The visual-semantic embedding space obtained by learning
binary labels is incoherent and cannot fully characterize the relevance
degrees. In addition to the use of binary labels, this paper further
incorporates continuous pseudo labels (generally approximated by text
similarity between captions) to indicate the relevance degrees. To learn a
coherent embedding space, we propose an image-text retrieval framework with
Binary and Continuous Label Supervision (BCLS), where binary labels are used to
guide the retrieval model to learn limited binary correlations, and continuous
labels are complementary to the learning of image-text semantic relations. For
the learning of binary labels, we improve the common Triplet ranking loss with
Soft Negative mining (Triplet-SN) to improve convergence. For the learning of
continuous labels, we design Kendall ranking loss inspired by Kendall rank
correlation coefficient (Kendall), which improves the correlation between the
similarity scores predicted by the retrieval model and the continuous labels.
To mitigate the noise introduced by the continuous pseudo labels, we further
design Sliding Window sampling and Hard Sample mining strategy (SW-HS) to
alleviate the impact of noise and reduce the complexity of our framework to the
same order of magnitude as the triplet ranking loss. Extensive experiments on
two image-text retrieval benchmarks demonstrate that our method can improve the
performance of state-of-the-art image-text retrieval models.
Related papers
- DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - Joint Class-Affinity Loss Correction for Robust Medical Image
Segmentation with Noisy Labels [22.721870430220598]
noisy labels prevent medical image segmentation algorithms from learning precise semantic correlations.
We present a novel perspective for noisy mitigation by incorporating both pixel-wise and pair-wise manners.
We propose a robust Joint Class-Affinity (JCAS) framework to combat label noise issues in medical image segmentation.
arXiv Detail & Related papers (2022-06-16T08:19:33Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Scene Graph Embeddings Using Relative Similarity Supervision [4.137464623395376]
We employ a graph convolutional network to exploit structure in scene graphs and produce image embeddings useful for semantic image retrieval.
We propose a novel loss function that operates on pairs of similar and dissimilar images and imposes relative ordering between them in embedding space.
We demonstrate that this Ranking loss, coupled with an intuitive triple sampling strategy, leads to robust representations that outperform well-known contrastive losses on the retrieval task.
arXiv Detail & Related papers (2021-04-06T09:13:05Z) - Reconstruction Regularized Deep Metric Learning for Multi-label Image
Classification [39.055689258395624]
We present a novel deep metric learning method to tackle the multi-label image classification problem.
Our model can be trained in an end-to-end manner.
arXiv Detail & Related papers (2020-07-27T13:28:50Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.