An Empirical Study of CLIP for Text-based Person Search
- URL: http://arxiv.org/abs/2308.10045v2
- Date: Thu, 21 Dec 2023 04:01:11 GMT
- Title: An Empirical Study of CLIP for Text-based Person Search
- Authors: Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
- Abstract summary: Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions.
Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks.
This paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS tasks.
- Score: 51.94743973155648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based Person Search (TBPS) aims to retrieve the person images using
natural language descriptions. Recently, Contrastive Language Image Pretraining
(CLIP), a universal large cross-modal vision-language pre-training model, has
remarkably performed over various cross-modal downstream tasks due to its
powerful cross-modal semantic learning capacity. TPBS, as a fine-grained
cross-modal retrieval task, is also facing the rise of research on the
CLIP-based TBPS. In order to explore the potential of the visual-language
pre-training model for downstream TBPS tasks, this paper makes the first
attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus
contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the
TBPS community. We revisit critical design considerations under CLIP, including
data augmentation and loss function. The model, with the aforementioned designs
and practical training tricks, can attain satisfactory performance without any
sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in
model generalization and model compression, demonstrating the effectiveness of
TBPS-CLIP from various aspects. This work is expected to provide empirical
insights and highlight future CLIP-based TBPS research.
Related papers
- RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports [19.915033191502328]
The Vision-Language Foundation model is increasingly investigated in the fields of computer vision and natural language processing.
To handle this issue, a CLIP-style retinal image foundation model is developed in this paper.
Our foundation model, RET-CLIP, is specifically trained on a dataset of 193,865 patients to extract general features of color fundus photographs.
arXiv Detail & Related papers (2024-05-23T03:20:51Z) - CLIP Can Understand Depth [5.6138460823631835]
We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction.
Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
arXiv Detail & Related papers (2024-02-05T18:09:33Z) - CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.
It has shown promising results across various tasks, attributable to its generalizability and interpretability.
Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.