Related papers: Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

URL: http://arxiv.org/abs/2309.01420v1
Date: Mon, 4 Sep 2023 08:11:36 GMT
Title: Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
Authors: Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang
Abstract summary: The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. There are two underlying inconsistencies between these two tasks that may impact the performance. We present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task.
Score: 42.791647210424664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance; i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%, 60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both the LUPerson-T dataset and code are available at https;//github.com/ZhiyinShao-H/UniPT.

Related papers

T2I-ConBench: Text-to-Image Benchmark for Continual Post-training [25.90279125119419]
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models.<n>We introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models.<n>It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment.
arXiv Detail & Related papers (2025-05-22T16:31:43Z)
ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features. It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set. It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP. To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z)
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z)
Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query. We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z)
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.