Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search
- URL: http://arxiv.org/abs/2303.04497v1
- Date: Wed, 8 Mar 2023 10:41:22 GMT
- Title: Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search
- Authors: Guanshuo Wang, Fufu Yu, Junjie Li, Qiong Jia, Shouhong Ding
- Abstract summary: Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
- Score: 17.360982091304137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based Person Search (TPS), is targeted on retrieving pedestrians to
match text descriptions instead of query images. Recent Vision-Language
Pre-training (VLP) models can bring transferable knowledge to downstream TPS
tasks, resulting in more efficient performance gains. However, existing TPS
methods improved by VLP only utilize pre-trained visual encoders, neglecting
the corresponding textual representation and breaking the significant modality
alignment learned from large-scale pre-training. In this paper, we explore the
full utilization of textual potential from VLP in TPS tasks. We build on the
proposed VLP-TPS baseline model, which is the first TPS model with both
pre-trained modalities. We propose the Multi-Integrity Description Constraints
(MIDC) to enhance the robustness of the textual modality by incorporating
different components of fine-grained corpus during training. Inspired by the
prompt approach for zero-shot classification with VLP models, we propose the
Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained
attributes as language hints for the image modality. Extensive experiments show
that our proposed TPS framework achieves state-of-the-art performance,
exceeding the previous best method by a margin.
Related papers
- Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - Weakly Supervised Vision-and-Language Pre-training with Relative
Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training.
Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training [27.103514548337404]
Existing approaches to vision-language pre-training rely on an object detector based on bounding boxes (regions)
In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps.
arXiv Detail & Related papers (2021-08-21T09:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.