Unified Pre-training with Pseudo Texts for Text-To-Image Person
Re-identification
- URL: http://arxiv.org/abs/2309.01420v1
- Date: Mon, 4 Sep 2023 08:11:36 GMT
- Title: Unified Pre-training with Pseudo Texts for Text-To-Image Person
Re-identification
- Authors: Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang
- Abstract summary: The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task.
There are two underlying inconsistencies between these two tasks that may impact the performance.
We present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task.
- Score: 42.791647210424664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-training task is indispensable for the text-to-image person
re-identification (T2I-ReID) task. However, there are two underlying
inconsistencies between these two tasks that may impact the performance; i)
Data inconsistency. A large domain gap exists between the generic images/texts
used in public pre-trained models and the specific person data in the T2I-ReID
task. This gap is especially severe for texts, as general textual data are
usually unable to describe specific people in fine-grained detail. ii) Training
inconsistency. The processes of pre-training of images and texts are
independent, despite cross-modality learning being critical to T2I-ReID. To
address the above issues, we present a new unified pre-training pipeline
(UniPT) designed specifically for the T2I-ReID task. We first build a
large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual
descriptions of images are automatically generated by the CLIP paradigm using a
divide-conquer-combine strategy. Benefiting from this dataset, we then utilize
a simple vision-and-language pre-training framework to explicitly align the
feature space of the image and text modalities during pre-training. In this
way, the pre-training task and the T2I-ReID task are made consistent with each
other on both data and training levels. Without the need for any bells and
whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%,
60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both
the LUPerson-T dataset and code are available at
https;//github.com/ZhiyinShao-H/UniPT.
Related papers
- T2I-ConBench: Text-to-Image Benchmark for Continual Post-training [25.90279125119419]
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models.<n>We introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models.<n>It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment.
arXiv Detail & Related papers (2025-05-22T16:31:43Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.