POAR: Towards Open Vocabulary Pedestrian Attribute Recognition
- URL: http://arxiv.org/abs/2303.14643v2
- Date: Mon, 7 Aug 2023 14:08:44 GMT
- Title: POAR: Towards Open Vocabulary Pedestrian Attribute Recognition
- Authors: Yue Zhang, Suchen Wang, Shichao Kan, Zhenyu Weng, Yigang Cen, Yap-peng
Tan
- Abstract summary: Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system.
It is impossible to exhaust all pedestrian attributes in the real world.
We develop a novel pedestrian open-attribute recognition framework.
- Score: 39.399286703315745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pedestrian attribute recognition (PAR) aims to predict the attributes of a
target pedestrian in a surveillance system. Existing methods address the PAR
problem by training a multi-label classifier with predefined attribute classes.
However, it is impossible to exhaust all pedestrian attributes in the real
world. To tackle this problem, we develop a novel pedestrian open-attribute
recognition (POAR) framework. Our key idea is to formulate the POAR problem as
an image-text search problem. We design a Transformer-based image encoder with
a masking strategy. A set of attribute tokens are introduced to focus on
specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and
encode corresponding attributes into visual embeddings. Each attribute category
is described as a natural language sentence and encoded by the text encoder.
Then, we compute the similarity between the visual and text embeddings of
attributes to find the best attribute descriptions for the input images.
Different from existing methods that learn a specific classifier for each
attribute category, we model the pedestrian at a part-level and explore the
searching method to handle the unseen attributes. Finally, a many-to-many
contrastive (MTMC) loss with masked tokens is proposed to train the network
since a pedestrian image can comprise multiple attributes. Extensive
experiments have been conducted on benchmark PAR datasets with an
open-attribute setting. The results verified the effectiveness of the proposed
POAR method, which can form a strong baseline for the POAR task. Our code is
available at \url{https://github.com/IvyYZ/POAR}.
Related papers
- Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition [11.316708754749103]
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior.
We propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information.
Our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition.
arXiv Detail & Related papers (2025-02-18T06:39:28Z) - ATPrompt: Textual Prompt Learning with Embedded Attributes [73.1352833091256]
We introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt.
We transform the text prompt from a category-centric form to an attribute-category hybrid form.
As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format.
arXiv Detail & Related papers (2024-12-12T16:57:20Z) - Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search [19.610244285078483]
We propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images.
We show that our proposed method significantly surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2024-06-06T03:34:42Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - SequencePAR: Understanding Pedestrian Attributes via A Sequence
Generation Paradigm [18.53048511206039]
We propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR.
It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts.
The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training.
arXiv Detail & Related papers (2023-12-04T05:42:56Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information.
We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z) - Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling.
Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model.
Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.