Text-Based Person Search with Limited Data
- URL: http://arxiv.org/abs/2110.10807v1
- Date: Wed, 20 Oct 2021 22:20:47 GMT
- Title: Text-Based Person Search with Limited Data
- Authors: Xiao Han, Sen He, Li Zhang, Tao Xiang
- Abstract summary: Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
- Score: 66.26504077270356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based person search (TBPS) aims at retrieving a target person from an
image gallery with a descriptive text query. Solving such a fine-grained
cross-modal retrieval task is challenging, which is further hampered by the
lack of large-scale datasets. In this paper, we present a framework with two
novel components to handle the problems brought by limited data. Firstly, to
fully utilize the existing small-scale benchmarking datasets for more
discriminative feature learning, we introduce a cross-modal momentum
contrastive learning framework to enrich the training data for a given
mini-batch. Secondly, we propose to transfer knowledge learned from existing
coarse-grained large-scale datasets containing image-text pairs from
drastically different problem domains to compensate for the lack of TBPS
training data. A transfer learning method is designed so that useful
information can be transferred despite the large domain gap. Armed with these
components, our method achieves new state of the art on the CUHK-PEDES dataset
with significant improvements over the prior art in terms of Rank-1 and mAP.
Our code is available at https://github.com/BrandonHanx/TextReID.
Related papers
- Text-Enhanced Data-free Approach for Federated Class-Incremental Learning [36.70524853012054]
Data-Free Knowledge Transfer plays a crucial role in addressing forgetting and data privacy problems.
Prior approaches lack the crucial synergy between DFKT and the model training phases.
We introduce LANDER to address this issue by utilizing label text embeddings produced by pretrained language models.
arXiv Detail & Related papers (2024-03-21T03:24:01Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation.
We develop a new adversarial learning based method, which is simple and efficient to apply.
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Multimodal Prototypical Networks for Few-shot Learning [20.100480009813953]
Cross-modal feature generation framework is used to enrich the low populated embedding space in few-shot scenarios.
We show that in such cases nearest neighbor classification is a viable approach and outperform state-of-the-art single-modal and multimodal few-shot learning methods.
arXiv Detail & Related papers (2020-11-17T19:32:59Z) - Data-Efficient Ranking Distillation for Image Retrieval [15.88955427198763]
Recent approaches tackle this issue using knowledge distillation to transfer knowledge from a deeper and heavier architecture to a much smaller network.
In this paper we address knowledge distillation for metric learning problems.
Unlike previous approaches, our proposed method jointly addresses the following constraints i) limited queries to teacher model, ii) black box teacher model with access to the final output representation, andiii) small fraction of original training data without any ground-truth labels.
arXiv Detail & Related papers (2020-07-10T10:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.