Related papers: CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification

CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification

URL: http://arxiv.org/abs/2410.11255v1
Date: Tue, 15 Oct 2024 04:25:58 GMT
Title: CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification
Authors: Huazhong Zhao, Lei Qi, Xin Geng,
Abstract summary: We propose a hard sample mining method called DFGS (Depth-First Graph Sampler) based on depth-first search. By leveraging the powerful cross-modal learning capabilities of CLIP, we aim to apply our DFGS method to extract challenging samples and form mini-batches with high discriminative difficulty. Our results demonstrate significant improvements over other methods, confirming the effectiveness of DFGS in providing challenging samples.
Score: 42.429118831928214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in pre-trained vision-language models like CLIP have shown promise in person re-identification (ReID) applications. However, their performance in generalizable person re-identification tasks remains suboptimal. The large-scale and diverse image-text pairs used in CLIP's pre-training may lead to a lack or insufficiency of certain fine-grained features. In light of these challenges, we propose a hard sample mining method called DFGS (Depth-First Graph Sampler), based on depth-first search, designed to offer sufficiently challenging samples to enhance CLIP's ability to extract fine-grained features. DFGS can be applied to both the image encoder and the text encoder in CLIP. By leveraging the powerful cross-modal learning capabilities of CLIP, we aim to apply our DFGS method to extract challenging samples and form mini-batches with high discriminative difficulty, providing the image model with more efficient and challenging samples that are difficult to distinguish, thereby enhancing the model's ability to differentiate between individuals. Our results demonstrate significant improvements over other methods, confirming the effectiveness of DFGS in providing challenging samples that enhance CLIP's performance in generalizable person re-identification.

Related papers

Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID [4.256800812615341]
Hard samples pose a significant challenge in person re-identification (ReID) tasks.<n>Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck.<n>We propose a novel multimodal-guided Hard Sample Generation and Learning framework.
arXiv Detail & Related papers (2025-07-15T09:14:01Z)
Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning [43.12759195699103]
Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing.<n>We propose Customized Curriculum Learning (CCL), a novel framework with two key innovations.<n>First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics.<n>Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance.
arXiv Detail & Related papers (2025-06-04T15:31:46Z)
Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection [23.48106270102081]
This paper tackles the challenge of detecting partially manipulated facial deepfakes. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters.
arXiv Detail & Related papers (2025-03-25T14:10:54Z)
Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation [21.20806568508201]
We show how to leverage class text information to mitigate distribution drifts encountered by vision-language models (VLMs) during test-time inference. We propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem. Experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT.
arXiv Detail & Related papers (2024-11-26T00:15:37Z)
Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification [2.552131151698595]
We proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID. We designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training. Our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
arXiv Detail & Related papers (2024-10-21T03:17:25Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored. We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations [19.800907485589402]
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. These tuned models tend to become highly specialized, limiting their practicality for real-world deployment. We propose a lightweight representation calibration method for fine-tuning CLIP.
arXiv Detail & Related papers (2024-03-12T01:47:17Z)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts. Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples. We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z)
Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting [11.39204323420108]
Deep convolutional neural networks have achieved remarkable success in face recognition. The current training benchmarks exhibit an imbalanced quality distribution. This poses issues for generalization on hard samples since they are underrepresented during training. Inspired by the well-known AdaBoost, we propose a sample-level weighting approach to incorporate the importance of different samples into the FR loss.
arXiv Detail & Related papers (2023-08-18T01:44:54Z)
GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning. Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z)
Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Feature Super-Resolution Based Facial Expression Recognition for Multi-scale Low-Resolution Faces [7.634398926381845]
Super-resolution method is often used to enhance low-resolution images, but the performance on FER task is limited when on images of very low resolution. In this work, inspired by feature super-resolution methods for object detection, we proposed a novel generative adversary network-based super-resolution method for robust facial expression recognition.
arXiv Detail & Related papers (2020-04-05T15:38:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.