USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval
- URL: http://arxiv.org/abs/2301.06844v1
- Date: Tue, 17 Jan 2023 12:42:58 GMT
- Title: USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval
- Authors: Yan Zhang, Zhong Ji, Di Wang, Yanwei Pang, Xuelong Li
- Abstract summary: Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
- Score: 115.28586222748478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a fundamental and challenging task in bridging language and vision
domains, Image-Text Retrieval (ITR) aims at searching for the target instances
that are semantically relevant to the given query from the other modality, and
its key challenge is to measure the semantic similarity across different
modalities. Although significant progress has been achieved, existing
approaches typically suffer from two major limitations: (1) It hurts the
accuracy of the representation by directly exploiting the bottom-up attention
based region-level features where each region is equally treated. (2) It limits
the scale of negative sample pairs by employing the mini-batch based end-to-end
training mechanism. To address these limitations, we propose a Unified Semantic
Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically,
we delicately design two simple but effective Global representation based
Semantic Enhancement (GSE) modules. One learns the global representation via
the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module.
The other module benefits from the pre-trained CLIP module, which provides a
novel scheme to exploit and transfer the knowledge from an off-the-shelf model,
noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the
training mechanism of MoCo into ITR, in which two dynamic queues are employed
to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified
Training Objective (UTO) is developed to learn from mini-batch based and
dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and
Flickr30K datasets demonstrate the superiority of both retrieval accuracy and
inference efficiency. Our source code will be released at
https://github.com/zhangy0822/USER.
Related papers
- Siamese Transformer Networks for Few-shot Image Classification [9.55588609556447]
Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples.
Existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both.
We propose a novel approach based on the Siamese Transformer Network (STN)
Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules.
arXiv Detail & Related papers (2024-07-16T14:27:23Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Self-Supervised Representation Learning with Meta Comprehensive
Regularization [11.387994024747842]
We introduce a module called CompMod with Meta Comprehensive Regularization (MCR), embedded into existing self-supervised frameworks.
We update our proposed model through a bi-level optimization mechanism, enabling it to capture comprehensive features.
We provide theoretical support for our proposed method from information theory and causal counterfactual perspective.
arXiv Detail & Related papers (2024-03-03T15:53:48Z) - Knowledge Transfer-Driven Few-Shot Class-Incremental Learning [23.163459923345556]
Few-shot class-incremental learning (FSCIL) aims to continually learn new classes using a few samples while not forgetting the old classes.
Despite the advance of existing FSCIL methods, the proposed knowledge transfer learning schemes are sub-optimal due to the insufficient optimization for the model's plasticity.
We propose a Random Episode Sampling and Augmentation (RESA) strategy that relies on diverse pseudo incremental tasks as agents to achieve the knowledge transfer.
arXiv Detail & Related papers (2023-06-19T14:02:45Z) - Learning to Learn Better for Video Object Segmentation [94.5753973590207]
We propose a novel framework that emphasizes Learning to Learn Better (LLB) target features for SVOS.
We design the discriminative label generation module (DLGM) and the adaptive fusion module to address these issues.
Our proposed LLB method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-12-05T09:10:34Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Adaptive Consistency Regularization for Semi-Supervised Transfer
Learning [31.66745229673066]
We consider semi-supervised learning and transfer learning jointly, leading to a more practical and competitive paradigm.
To better exploit the value of both pre-trained weights and unlabeled target examples, we introduce adaptive consistency regularization.
Our proposed adaptive consistency regularization outperforms state-of-the-art semi-supervised learning techniques such as Pseudo Label, Mean Teacher, and MixMatch.
arXiv Detail & Related papers (2021-03-03T05:46:39Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.