Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification
- URL: http://arxiv.org/abs/2312.16797v1
- Date: Thu, 28 Dec 2023 03:00:19 GMT
- Title: Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification
- Authors: Yajing Zhai, Yawen Zeng, Zhiyong Huang, Zheng Qin, Xin Jin, Da Cao
- Abstract summary: We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
- Score: 18.01407937934588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fine-grained attribute descriptions can significantly supplement the
valuable semantic information for person image, which is vital to the success
of person re-identification (ReID) task. However, current ReID algorithms
typically failed to effectively leverage the rich contextual information
available, primarily due to their reliance on simplistic and coarse utilization
of image attributes. Recent advances in artificial intelligence generated
content have made it possible to automatically generate plentiful fine-grained
attribute descriptions and make full use of them. Thereby, this paper explores
the potential of using the generated multiple person attributes as prompts in
ReID tasks with off-the-shelf (large) models for more accurate retrieval
results. To this end, we present a new framework called Multi-Prompts ReID
(MP-ReID), based on prompt learning and language models, to fully dip fine
attributes to assist ReID task. Specifically, MP-ReID first learns to
hallucinate diverse, informative, and promptable sentences for describing the
query images. This procedure includes (i) explicit prompts of which attributes
a person has and furthermore (ii) implicit learnable prompts for
adjusting/conditioning the criteria used towards this person identity matching.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT
and VQA models. Moreover, an alignment module is designed to fuse multi-prompts
(i.e., explicit and implicit ones) progressively and mitigate the cross-modal
gap. Extensive experiments on the existing attribute-involved ReID datasets,
namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and
rationality of the proposed MP-ReID solution.
Related papers
- Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning [29.19130646630545]
We introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work.
MIP is designed to effectively utilize both invariant and specific information for identification.
Our proposed MIP performs better than most state-of-the-art methods.
arXiv Detail & Related papers (2024-06-18T06:39:03Z) - Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z) - Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis [6.215536001787723]
Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models.
We propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content.
KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.
arXiv Detail & Related papers (2023-11-29T18:51:46Z) - ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation [13.338363107777438]
We propose a novel recommendation model by incorporating ID embeddings to enhance the salient features of both content and structure.
Our method is superior to state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.
arXiv Detail & Related papers (2023-11-10T09:41:28Z) - Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification [78.52704557647438]
We propose a novel FIne-grained Representation and Recomposition (FIRe$2$) framework to tackle both limitations without any auxiliary annotation or data.
Experiments demonstrate that FIRe$2$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
arXiv Detail & Related papers (2023-08-21T12:59:48Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.