Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification
- URL: http://arxiv.org/abs/2312.16797v1
- Date: Thu, 28 Dec 2023 03:00:19 GMT
- Title: Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification
- Authors: Yajing Zhai, Yawen Zeng, Zhiyong Huang, Zheng Qin, Xin Jin, Da Cao
- Abstract summary: We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
- Score: 18.01407937934588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fine-grained attribute descriptions can significantly supplement the
valuable semantic information for person image, which is vital to the success
of person re-identification (ReID) task. However, current ReID algorithms
typically failed to effectively leverage the rich contextual information
available, primarily due to their reliance on simplistic and coarse utilization
of image attributes. Recent advances in artificial intelligence generated
content have made it possible to automatically generate plentiful fine-grained
attribute descriptions and make full use of them. Thereby, this paper explores
the potential of using the generated multiple person attributes as prompts in
ReID tasks with off-the-shelf (large) models for more accurate retrieval
results. To this end, we present a new framework called Multi-Prompts ReID
(MP-ReID), based on prompt learning and language models, to fully dip fine
attributes to assist ReID task. Specifically, MP-ReID first learns to
hallucinate diverse, informative, and promptable sentences for describing the
query images. This procedure includes (i) explicit prompts of which attributes
a person has and furthermore (ii) implicit learnable prompts for
adjusting/conditioning the criteria used towards this person identity matching.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT
and VQA models. Moreover, an alignment module is designed to fuse multi-prompts
(i.e., explicit and implicit ones) progressively and mitigate the cross-modal
gap. Extensive experiments on the existing attribute-involved ReID datasets,
namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and
rationality of the proposed MP-ReID solution.
Related papers
- Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation [26.737971605928358]
We propose an ID-free MultimOdal TOken Representation scheme named MOTOR.
We first employ product quantization to discretize each item's multimodal features into discrete token IDs.
We then interpret the token embeddings corresponding to these token IDs as implicit item features.
The resulting representations can replace the original ID embeddings and transform the original multimodal recommender into ID-free system.
arXiv Detail & Related papers (2024-10-25T03:06:10Z) - CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - ASI++: Towards Distributionally Balanced End-to-End Generative Retrieval [29.65717446547002]
ASI++ is a novel fully end-to-end generative retrieval method.
It aims to simultaneously learn balanced ID assignments and improve retrieval performance.
arXiv Detail & Related papers (2024-05-23T07:54:57Z) - Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z) - Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification [78.52704557647438]
We propose a novel FIne-grained Representation and Recomposition (FIRe$2$) framework to tackle both limitations without any auxiliary annotation or data.
Experiments demonstrate that FIRe$2$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
arXiv Detail & Related papers (2023-08-21T12:59:48Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.