Learning Transferable Pedestrian Representation from Multimodal
Information Supervision
- URL: http://arxiv.org/abs/2304.05554v1
- Date: Wed, 12 Apr 2023 01:20:58 GMT
- Title: Learning Transferable Pedestrian Representation from Multimodal
Information Supervision
- Authors: Liping Bao, Longhui Wei, Xiaoyu Qiu, Wengang Zhou, Houqiang Li, Qi
Tian
- Abstract summary: VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
- Score: 174.5150760804929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent researches on unsupervised person re-identification~(reID) have
demonstrated that pre-training on unlabeled person images achieves superior
performance on downstream reID tasks than pre-training on ImageNet. However,
those pre-trained methods are specifically designed for reID and suffer
flexible adaption to other pedestrian analysis tasks. In this paper, we propose
VAL-PAT, a novel framework that learns transferable representations to enhance
various pedestrian analysis tasks with multimodal information. To train our
framework, we introduce three learning objectives, \emph{i.e.,} self-supervised
contrastive learning, image-text contrastive learning and multi-attribute
classification. The self-supervised contrastive learning facilitates the
learning of the intrinsic pedestrian properties, while the image-text
contrastive learning guides the model to focus on the appearance information of
pedestrians.Meanwhile, multi-attribute classification encourages the model to
recognize attributes to excavate fine-grained pedestrian information. We first
perform pre-training on LUPerson-TA dataset, where each image contains text and
attribute annotations, and then transfer the learned representations to various
downstream tasks, including person reID, person attribute recognition and
text-based person search. Extensive experiments demonstrate that our framework
facilitates the learning of general pedestrian representations and thus leads
to promising results on various pedestrian analysis tasks.
Related papers
- Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL.
We propose a data repair algorithm to curate contextually fair data to reduce model bias.
We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - Self-Supervised Visual Representation Learning Using Lightweight
Architectures [0.0]
In self-supervised learning, a model is trained to solve a pretext task, using a data set whose annotations are created by a machine.
We critically examine the most notable pretext tasks to extract features from image data.
We study the performance of various self-supervised techniques keeping all other parameters uniform.
arXiv Detail & Related papers (2021-10-21T14:13:10Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Semi-supervised Learning with a Teacher-student Network for Generalized
Attribute Prediction [7.462336024223667]
This paper presents a study on semi-supervised learning to solve the visual attribute prediction problem.
Our method achieves competitive performance on various benchmarks for fashion attribute prediction.
arXiv Detail & Related papers (2020-07-14T02:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.