Towards Unified Text-based Person Retrieval: A Large-scale
Multi-Attribute and Language Search Benchmark
- URL: http://arxiv.org/abs/2306.02898v4
- Date: Mon, 14 Aug 2023 07:37:27 GMT
- Title: Towards Unified Text-based Person Retrieval: A Large-scale
Multi-Attribute and Language Search Benchmark
- Authors: Shuyu Yang, Yinan Zhou, Yaxiong Wang, Yujiao Wu, Li Zhu, Zhedong Zheng
- Abstract summary: We introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS.
Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset.
To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning framework.
- Score: 24.366997699462075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a large Multi-Attribute and Language Search
dataset for text-based person retrieval, called MALS, and explore the
feasibility of performing pre-training on both attribute recognition and
image-text matching tasks in one stone. In particular, MALS contains 1,510,330
image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES,
and all images are annotated with 27 attributes. Considering the privacy
concerns and annotation costs, we leverage the off-the-shelf diffusion models
to generate the dataset. To verify the feasibility of learning from the
generated data, we develop a new joint Attribute Prompt Learning and Text
Matching Learning (APTM) framework, considering the shared knowledge between
attribute and text. As the name implies, APTM contains an attribute prompt
learning stream and a text matching learning stream. (1) The attribute prompt
learning leverages the attribute prompts for image-attribute alignment, which
enhances the text matching learning. (2) The text matching learning facilitates
the representation learning on fine-grained details, and in turn, boosts the
attribute prompt learning. Extensive experiments validate the effectiveness of
the pre-training on MALS, achieving state-of-the-art retrieval performance via
APTM on three challenging real-world benchmarks. In particular, APTM achieves a
consistent improvement of +6.96%, +7.68%, and +16.95% Recall@1 accuracy on
CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - AKGNet: Attribute Knowledge-Guided Unsupervised Lung-Infected Area Segmentation [25.874281336821685]
Lung-infected area segmentation is crucial for assessing the severity of lung diseases.
We propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation.
AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration.
arXiv Detail & Related papers (2024-04-17T02:36:02Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image
Generation [18.36261166580862]
Text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions.
Existing methods mainly extract the text information from only one sentence to represent an image.
We propose an effective text representation method with the complements of attribute information.
arXiv Detail & Related papers (2022-09-28T12:28:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural
Language [36.319953919737245]
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.
We propose an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions.
We achieve success as well as the performance boosting by a robust feature learning.
arXiv Detail & Related papers (2020-05-15T02:22:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.