Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search
- URL: http://arxiv.org/abs/2311.09084v1
- Date: Wed, 15 Nov 2023 16:26:49 GMT
- Title: Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search
- Authors: Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen,
Liang Lin
- Abstract summary: Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
- Score: 60.626459715780605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a descriptive text query, text-based person search (TBPS) aims to
retrieve the best-matched target person from an image gallery. Such a
cross-modal retrieval task is quite challenging due to significant modality
gap, fine-grained differences and insufficiency of annotated data. To better
align the two modalities, most existing works focus on introducing
sophisticated network structures and auxiliary tasks, which are complex and
hard to implement. In this paper, we propose a simple yet effective dual
Transformer model for text-based person search. By exploiting a hardness-aware
contrastive learning strategy, our model achieves state-of-the-art performance
without any special design for local feature alignment or side information.
Moreover, we propose a proximity data generation (PDG) module to automatically
produce more diverse data for cross-modal training. The PDG module first
introduces an automatic generation algorithm based on a text-to-image diffusion
model, which generates new text-image pair samples in the proximity space of
original ones. Then it combines approximate text generation and feature-level
mixup during training to further strengthen the data diversity. The PDG module
can largely guarantee the reasonability of the generated samples that are
directly used for training without any human inspection for noise rejection. It
improves the performance of our model significantly, providing a feasible
solution to the data insufficiency problem faced by such fine-grained
visual-linguistic tasks. Extensive experiments on two popular datasets of the
TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach
outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%,
4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be
available at https://github.com/HCPLab-SYSU/PersonSearch-CTLG
Related papers
- Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Generative Adversarial Networks for Annotated Data Augmentation in Data
Sparse NLU [0.76146285961466]
Data sparsity is one of the key challenges associated with model development in Natural Language Understanding.
We present our results on boosting NLU model performance through training data augmentation using a sequential generative adversarial network (GAN)
Our experiments reveal synthetic data generated using the sequential generative adversarial network provides significant performance boosts across multiple metrics.
arXiv Detail & Related papers (2020-12-09T20:38:17Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.