Related papers: Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Learning Transferable Pedestrian Representation from Multimodal Information Supervision

URL: http://arxiv.org/abs/2304.05554v1
Date: Wed, 12 Apr 2023 01:20:58 GMT
Title: Learning Transferable Pedestrian Representation from Multimodal Information Supervision
Authors: Liping Bao, Longhui Wei, Xiaoyu Qiu, Wengang Zhou, Houqiang Li, Qi Tian
Abstract summary: VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations. We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
Score: 174.5150760804929
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks. In this paper, we propose VAL-PAT, a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. To train our framework, we introduce three learning objectives, \emph{i.e.,} self-supervised contrastive learning, image-text contrastive learning and multi-attribute classification. The self-supervised contrastive learning facilitates the learning of the intrinsic pedestrian properties, while the image-text contrastive learning guides the model to focus on the appearance information of pedestrians.Meanwhile, multi-attribute classification encourages the model to recognize attributes to excavate fine-grained pedestrian information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations, and then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search. Extensive experiments demonstrate that our framework facilitates the learning of general pedestrian representations and thus leads to promising results on various pedestrian analysis tasks.

Related papers

Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained backbone CLIP to learn basic multimodal features. It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL. We propose a data repair algorithm to curate contextually fair data to reduce model bias. We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP. To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z)
Self-Supervised Visual Representation Learning Using Lightweight Architectures [0.0]
In self-supervised learning, a model is trained to solve a pretext task, using a data set whose annotations are created by a machine. We critically examine the most notable pretext tasks to extract features from image data. We study the performance of various self-supervised techniques keeping all other parameters uniform.
arXiv Detail & Related papers (2021-10-21T14:13:10Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z)
Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction [7.462336024223667]
This paper presents a study on semi-supervised learning to solve the visual attribute prediction problem. Our method achieves competitive performance on various benchmarks for fashion attribute prediction.
arXiv Detail & Related papers (2020-07-14T02:06:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.