CLIP-Driven Fine-grained Text-Image Person Re-identification
- URL: http://arxiv.org/abs/2210.10276v1
- Date: Wed, 19 Oct 2022 03:43:12 GMT
- Title: CLIP-Driven Fine-grained Text-Image Person Re-identification
- Authors: Shuanglin Yan, Neng Dong, Liyan Zhang, Jinhui Tang
- Abstract summary: TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
- Score: 50.94827165464813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: TIReID aims to retrieve the image corresponding to the given text query from
a pool of candidate images. Existing methods employ prior knowledge from
single-modality pre-training to facilitate learning, but lack multi-modal
correspondences. Besides, due to the substantial gap between modalities,
existing methods embed the original modal features into the same latent space
for cross-modal alignment. However, feature embedding may lead to intra-modal
information distortion. Recently, CLIP has attracted extensive attention from
researchers due to its powerful semantic concept learning capacity and rich
multi-modal knowledge, which can help us solve the above problems. Accordingly,
in the paper, we propose a CLIP-driven Fine-grained information excavation
framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
To transfer the multi-modal knowledge effectively, we perform fine-grained
information excavation to mine intra-modal discriminative clues and inter-modal
correspondences. Specifically, we first design a multi-grained global feature
learning module to fully mine intra-modal discriminative local information,
which can emphasize identity-related discriminative clues by enhancing the
interactions between global image (text) and informative local patches (words).
Secondly, cross-grained feature refinement (CFR) and fine-grained
correspondence discovery (FCD) modules are proposed to establish the
cross-grained and fine-grained interactions between modalities, which can
filter out non-modality-shared image patches/words and mine cross-modal
correspondences from coarse to fine. CFR and FCD are removed during inference
to save computational costs. Note that the above process is performed in the
original modality space without further feature embedding. Extensive
experiments on multiple benchmarks demonstrate the superior performance of our
method on TIReID.
Related papers
- Detached and Interactive Multimodal Learning [17.843121072628477]
This paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities.
It addresses competition by separately training each modality encoder with isolated learning objectives.
Experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method.
arXiv Detail & Related papers (2024-07-28T15:38:58Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - One-stage Modality Distillation for Incomplete Multimodal Learning [7.791488931628906]
This paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion.
The proposed framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.
arXiv Detail & Related papers (2023-09-15T07:12:27Z) - A Dimensional Structure based Knowledge Distillation Method for
Cross-Modal Learning [15.544134849816528]
We discover the correlation between feature discriminability and dimensional structure (DS) by analyzing and observing features extracted from simple and hard tasks.
We propose a novel cross-modal knowledge distillation (CMKD) method for better supervised cross-modal learning (CML) performance.
The proposed method enforces output features to be channel-wise independent and intermediate ones to be uniformly distributed, thereby learning semantically irrelevant features from the hard task to boost its accuracy.
arXiv Detail & Related papers (2023-06-28T07:29:26Z) - Knowledge-Enhanced Hierarchical Information Correlation Learning for
Multi-Modal Rumor Detection [82.94413676131545]
We propose a novel knowledge-enhanced hierarchical information correlation learning approach (KhiCL) for multi-modal rumor detection.
KhiCL exploits cross-modal joint dictionary to transfer the heterogeneous unimodality features into the common feature space.
It extracts visual and textual entities from images and text, and designs a knowledge relevance reasoning strategy.
arXiv Detail & Related papers (2023-06-28T06:08:20Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Vision-Language Pre-Training with Triple Contrastive Learning [45.80365827890119]
We propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision.
Ours is the first work that takes into account local structure information for multi-modality representation learning.
arXiv Detail & Related papers (2022-02-21T17:54:57Z) - CMTR: Cross-modality Transformer for Visible-infrared Person
Re-identification [38.96033760300123]
Cross-modality transformer-based method (CMTR) for visible-infrared person re-identification task.
We design the novel modality embeddings, which are fused with token embeddings to encode modalities' information.
Our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.
arXiv Detail & Related papers (2021-10-18T03:12:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.