Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
Fusion
- URL: http://arxiv.org/abs/2312.10692v1
- Date: Sun, 17 Dec 2023 11:59:14 GMT
- Title: Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
Fusion
- Authors: Xiao Wang, Jiandong Jin, Chenglong Li, Jin Tang, Cheng Zhang, Wei Wang
- Abstract summary: We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
- Score: 24.804554907625594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained
CNN (e.g., ResNet) as their backbone network for visual feature learning, which
might obtain sub-optimal results due to the insufficient employment of the
relations between pedestrian images and attribute labels. In this paper, we
formulate PAR as a vision-language fusion problem and fully exploit the
relations between pedestrian images and attribute labels. Specifically, the
attribute phrases are first expanded into sentences, and then the pre-trained
vision-language model CLIP is adopted as our backbone for feature embedding of
visual images and attribute descriptions. The contrastive learning objective
connects the vision and language modalities well in the CLIP-based feature
space, and the Transformer layers used in CLIP can capture the long-range
relations between pixels. Then, a multi-modal Transformer is adopted to fuse
the dual features effectively and feed-forward network is used to predict
attributes. To optimize our network efficiently, we propose the region-aware
prompt tuning technique to adjust very few parameters (i.e., only the prompt
vectors and classification heads) and fix both the pre-trained VL model and
multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75%
learnable parameters compared with the fine-tuning strategy. It also achieves
new state-of-the-art performance on both standard and zero-shot settings for
PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The
source code and pre-trained models will be released on
https://github.com/Event-AHU/OpenPAR.
Related papers
- Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly
Detection [49.510604614688745]
We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP.
We note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps.
arXiv Detail & Related papers (2023-11-01T11:39:22Z) - RLIPv2: Fast Scaling of Relational Language-Image Pre-training [53.21796397618875]
We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data.
Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding.
RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
arXiv Detail & Related papers (2023-08-18T07:17:09Z) - Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information.
We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [119.43299939907685]
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones.
Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention.
We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
arXiv Detail & Related papers (2021-12-16T05:49:51Z) - CLIP Meets Video Captioners: Attribute-Aware Representation Learning
Promotes Accurate Captioning [34.46948978082648]
ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation.
This paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions.
We introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes.
arXiv Detail & Related papers (2021-11-30T06:37:44Z) - RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for
Image Recognition [123.59890802196797]
We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition.
We construct convolutional layers inside a RepMLP during training and merge them into the FC for inference.
By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs.
arXiv Detail & Related papers (2021-05-05T06:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.