Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification
- URL: http://arxiv.org/abs/2310.17218v1
- Date: Thu, 26 Oct 2023 08:12:53 GMT
- Title: Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification
- Authors: Jiachen Li and Xiaojin Gong
- Abstract summary: This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
- Score: 13.090873217313732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work aims to adapt large-scale pre-trained vision-language models, such
as contrastive language-image pretraining (CLIP), to enhance the performance of
object reidentification (Re-ID) across various supervision settings. Although
prompt learning has enabled a recent work named CLIP-ReID to achieve promising
performance, the underlying mechanisms and the necessity of prompt learning
remain unclear due to the absence of semantic labels in ReID tasks. In this
work, we first analyze the role prompt learning in CLIP-ReID and identify its
limitations. Based on our investigations, we propose a simple yet effective
approach to adapt CLIP for supervised object Re-ID. Our approach directly
fine-tunes the image encoder of CLIP using a prototypical contrastive learning
(PCL) loss, eliminating the need for prompt learning. Experimental results on
both person and vehicle Re-ID datasets demonstrate the competitiveness of our
method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP
fine-tuning approach to unsupervised scenarios, where we achieve state-of-the
art performance.
Related papers
- ET tu, CLIP? Addressing Common Object Errors for Unseen Environments [0.2714641498775158]
We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task.
In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective.
arXiv Detail & Related papers (2024-06-25T18:35:13Z) - What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - CLIP Can Understand Depth [5.6138460823631835]
We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction.
Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
arXiv Detail & Related papers (2024-02-05T18:09:33Z) - Data Poisoning for In-context Learning [49.77204165250528]
In-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks.
This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks.
We introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL.
arXiv Detail & Related papers (2024-02-03T14:20:20Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Deep Representations via Contrastive Learning for Instance
Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL)
In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.