CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval
- URL: http://arxiv.org/abs/2309.09496v2
- Date: Tue, 2 Jan 2024 05:01:26 GMT
- Title: CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval
- Authors: Yating Liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin
Liao
- Abstract summary: We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
- Score: 66.93563107820687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based Person Retrieval (TPR) aims to retrieve the target person images
given a textual query. The primary challenge lies in bridging the substantial
gap between vision and language modalities, especially when dealing with
limited large-scale datasets. In this paper, we introduce a CLIP-based
Synergistic Knowledge Transfer (CSKT) approach for TPR. Specifically, to
explore the CLIP's knowledge on input side, we first propose a Bidirectional
Prompts Transferring (BPT) module constructed by text-to-image and
image-to-text bidirectional prompts and coupling projections. Secondly, Dual
Adapters Transferring (DAT) is designed to transfer knowledge on output side of
Multi-Head Attention (MHA) in vision and language. This synergistic two-way
collaborative mechanism promotes the early-stage feature fusion and efficiently
exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art
approaches across three benchmark datasets when the training parameters merely
account for 7.4% of the entire model, demonstrating its remarkable efficiency,
effectiveness and generalization.
Related papers
- Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Vision-Language Pre-Training with Triple Contrastive Learning [45.80365827890119]
We propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision.
Ours is the first work that takes into account local structure information for multi-modality representation learning.
arXiv Detail & Related papers (2022-02-21T17:54:57Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.