POAR: Towards Open Vocabulary Pedestrian Attribute Recognition
- URL: http://arxiv.org/abs/2303.14643v2
- Date: Mon, 7 Aug 2023 14:08:44 GMT
- Title: POAR: Towards Open Vocabulary Pedestrian Attribute Recognition
- Authors: Yue Zhang, Suchen Wang, Shichao Kan, Zhenyu Weng, Yigang Cen, Yap-peng
Tan
- Abstract summary: Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system.
It is impossible to exhaust all pedestrian attributes in the real world.
We develop a novel pedestrian open-attribute recognition framework.
- Score: 39.399286703315745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pedestrian attribute recognition (PAR) aims to predict the attributes of a
target pedestrian in a surveillance system. Existing methods address the PAR
problem by training a multi-label classifier with predefined attribute classes.
However, it is impossible to exhaust all pedestrian attributes in the real
world. To tackle this problem, we develop a novel pedestrian open-attribute
recognition (POAR) framework. Our key idea is to formulate the POAR problem as
an image-text search problem. We design a Transformer-based image encoder with
a masking strategy. A set of attribute tokens are introduced to focus on
specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and
encode corresponding attributes into visual embeddings. Each attribute category
is described as a natural language sentence and encoded by the text encoder.
Then, we compute the similarity between the visual and text embeddings of
attributes to find the best attribute descriptions for the input images.
Different from existing methods that learn a specific classifier for each
attribute category, we model the pedestrian at a part-level and explore the
searching method to handle the unseen attributes. Finally, a many-to-many
contrastive (MTMC) loss with masked tokens is proposed to train the network
since a pedestrian image can comprise multiple attributes. Extensive
experiments have been conducted on benchmark PAR datasets with an
open-attribute setting. The results verified the effectiveness of the proposed
POAR method, which can form a strong baseline for the POAR task. Our code is
available at \url{https://github.com/IvyYZ/POAR}.
Related papers
- MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search [19.610244285078483]
We propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images.
We show that our proposed method significantly surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2024-06-06T03:34:42Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - SequencePAR: Understanding Pedestrian Attributes via A Sequence
Generation Paradigm [18.53048511206039]
We propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR.
It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts.
The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training.
arXiv Detail & Related papers (2023-12-04T05:42:56Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information.
We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z) - Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling.
Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model.
Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z) - SMILE: Semantically-guided Multi-attribute Image and Layout Editing [154.69452301122175]
Attribute image manipulation has been a very active topic since the introduction of Generative Adversarial Networks (GANs)
We present a multimodal representation that handles all attributes, be it guided by random noise or images, while only using the underlying domain information of the target domain.
Our method is capable of adding, removing or changing either fine-grained or coarse attributes by using an image as a reference or by exploring the style distribution space.
arXiv Detail & Related papers (2020-10-05T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.