Text-based Person Search without Parallel Image-Text Data
- URL: http://arxiv.org/abs/2305.12964v2
- Date: Fri, 4 Aug 2023 13:04:24 GMT
- Title: Text-based Person Search without Parallel Image-Text Data
- Authors: Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie
and Min Zhang
- Abstract summary: Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
- Score: 52.63433741872629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based person search (TBPS) aims to retrieve the images of the target
person from a large image gallery based on a given natural language
description. Existing methods are dominated by training models with parallel
image-text pairs, which are very costly to collect. In this paper, we make the
first attempt to explore TBPS without parallel image-text data ($\mu$-TBPS), in
which only non-parallel images and texts, or even image-only data, can be
adopted. Towards this end, we propose a two-stage framework,
generation-then-retrieval (GTR), to first generate the corresponding pseudo
text for each image and then perform the retrieval in a supervised manner. In
the generation stage, we propose a fine-grained image captioning strategy to
obtain an enriched description of the person image, which firstly utilizes a
set of instruction prompts to activate the off-the-shelf pretrained
vision-language model to capture and generate fine-grained person attributes,
and then converts the extracted attributes into a textual description via the
finetuned large language model or the hand-crafted template. In the retrieval
stage, considering the noise interference of the generated texts for training
model, we develop a confidence score-based training scheme by enabling more
reliable texts to contribute more during the training. Experimental results on
multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that
the proposed GTR can achieve a promising performance without relying on
parallel image-text data.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Semi-supervised Text-based Person Search [47.14739994781334]
Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning.
We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS.
We propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data.
arXiv Detail & Related papers (2024-04-28T07:47:52Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.