VTBR: Semantic-based Pretraining for Person Re-Identification
- URL: http://arxiv.org/abs/2110.05074v1
- Date: Mon, 11 Oct 2021 08:19:45 GMT
- Title: VTBR: Semantic-based Pretraining for Person Re-Identification
- Authors: Suncheng Xiang, Zirui Zhang, Mengyuan Guan, Hao Chen, Binjie Yan, Ting
Liu, Yuzhuo Fu
- Abstract summary: We propose a pure semantic-based pretraining approach named VTBR.
We train convolutional networks from scratch on the captions of FineGPR-C dataset, and transfer them to downstream Re-ID tasks.
- Score: 14.0819152482295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining is a dominant paradigm in computer vision. Generally, supervised
ImageNet pretraining is commonly used to initialize the backbones of person
re-identification (Re-ID) models. However, recent works show a surprising
result that ImageNet pretraining has limited impacts on Re-ID system due to the
large domain gap between ImageNet and person Re-ID data. To seek an alternative
to traditional pretraining, we manually construct a diversified FineGPR-C
caption dataset for the first time on person Re-ID events. Based on it, we
propose a pure semantic-based pretraining approach named VTBR, which uses dense
captions to learn visual representations with fewer images. Specifically, we
train convolutional networks from scratch on the captions of FineGPR-C dataset,
and transfer them to downstream Re-ID tasks. Comprehensive experiments
conducted on benchmarks show that our VTBR can achieve competitive performance
compared with ImageNet pretraining -- despite using up to 1.4x fewer images,
revealing its potential in Re-ID pretraining.
Related papers
- DreamTeacher: Pretraining Image Backbones with Deep Generative Models [103.62397699392346]
We introduce a self-supervised feature representation learning framework that utilizes generative networks for pre-training downstream image backbones.
We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet.
We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board.
arXiv Detail & Related papers (2023-07-14T17:17:17Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Semantic-aware Dense Representation Learning for Remote Sensing Image
Change Detection [20.761672725633936]
Training deep learning-based change detection model heavily depends on labeled data.
Recent trend is using remote sensing (RS) data to obtain in-domain representations via supervised or self-supervised learning (SSL)
We propose dense semantic-aware pre-training for RS image CD via sampling multiple class-balanced points.
arXiv Detail & Related papers (2022-05-27T06:08:33Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Semantic decoupled representation learning for remote sensing image
change detection [17.548248093344576]
We propose a semantic decoupled representation learning for RS image CD.
We disentangle representations of different semantic regions by leveraging the semantic mask.
We additionally force the model to distinguish different semantic representations, which benefits the recognition of objects of interest in the downstream CD task.
arXiv Detail & Related papers (2022-01-15T07:35:26Z) - Unleashing the Potential of Unsupervised Pre-Training with
Intra-Identity Regularization for Person Re-Identification [10.045028405219641]
We design an Unsupervised Pre-training framework for ReID based on the contrastive learning (CL) pipeline, dubbed UP-ReID.
We introduce an intra-identity (I$2$-)regularization in the UP-ReID, which is instantiated as two constraints coming from global image aspect and local patch aspect.
Our UP-ReID pre-trained model can significantly benefit the downstream ReID fine-tuning and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-12-01T07:16:37Z) - Semantic-Aware Generation for Self-Supervised Visual Representation
Learning [116.5814634936371]
We advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image.
SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations.
We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition.
arXiv Detail & Related papers (2021-11-25T16:46:13Z) - The Role of Pre-Training in High-Resolution Remote Sensing Scene
Classification [0.0]
We show that training models from scratch on newer datasets yields comparable results to fine-tuning the models pre-trained on ImageNet.
In many cases the best representations are obtained by using a second round of pre-training using in-domain data.
arXiv Detail & Related papers (2021-11-05T18:30:54Z) - Unsupervised Pre-training for Person Re-identification [90.98552221699508]
We present a large scale unlabeled person re-identification (Re-ID) dataset "LUPerson"
We make the first attempt of performing unsupervised pre-training for improving the generalization ability of the learned person Re-ID feature representation.
arXiv Detail & Related papers (2020-12-07T14:48:26Z) - VirTex: Learning Visual Representations from Textual Annotations [25.104705278771895]
VirTex is a pretraining approach using semantically dense captions to learn visual representations.
We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks.
On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised.
arXiv Detail & Related papers (2020-06-11T17:58:48Z) - RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training [77.62171090230986]
We propose an easily scalable and self-supervised technique that can be used to pre-train any semantic RGB segmentation method.
In particular, our pre-training approach makes use of automatically generated labels that can be obtained using depth sensors.
We show how our proposed self-supervised pre-training with HN-labels can be used to replace ImageNet pre-training.
arXiv Detail & Related papers (2020-02-06T11:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.