Related papers: Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification

Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification

URL: http://arxiv.org/abs/2312.16797v1
Date: Thu, 28 Dec 2023 03:00:19 GMT
Title: Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification
Authors: Yajing Zhai, Yawen Zeng, Zhiyong Huang, Zheng Qin, Xin Jin, Da Cao
Abstract summary: We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models. MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
Score: 18.01407937934588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The fine-grained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID) task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and language models, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution.

Related papers

TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation [37.212823625296885]
We propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER) TextTIGER augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) Experiments show TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts.
arXiv Detail & Related papers (2025-04-25T11:27:44Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID. It adopts prompt-tuning strategies to leverage attribute-based text knowledge. Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models [49.09606704563898]
Person re-identification is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. We propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. We introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.
arXiv Detail & Related papers (2025-02-27T10:34:14Z)
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education [30.071212702797016]
In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning.
arXiv Detail & Related papers (2025-02-09T11:46:05Z)
Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation [26.737971605928358]
We propose an ID-free MultimOdal TOken Representation scheme named MOTOR. We first employ product quantization to discretize each item's multimodal features into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features. The resulting representations can replace the original ID embeddings and transform the original multimodal recommender into ID-free system.
arXiv Detail & Related papers (2024-10-25T03:06:10Z)
CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP) We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images. We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
ASI++: Towards Distributionally Balanced End-to-End Generative Retrieval [29.65717446547002]
ASI++ is a novel fully end-to-end generative retrieval method. It aims to simultaneously learn balanced ID assignments and improve retrieval performance.
arXiv Detail & Related papers (2024-05-23T07:54:57Z)
Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z)
Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification [78.52704557647438]
We propose a novel FIne-grained Representation and Recomposition (FIRe$2$) framework to tackle both limitations without any auxiliary annotation or data. Experiments demonstrate that FIRe$2$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
arXiv Detail & Related papers (2023-08-21T12:59:48Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Learning Transferable Pedestrian Representation from Multimodal Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations. We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z)
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules. MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips. We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.