Related papers: POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

URL: http://arxiv.org/abs/2303.14643v2
Date: Mon, 7 Aug 2023 14:08:44 GMT
Title: POAR: Towards Open Vocabulary Pedestrian Attribute Recognition
Authors: Yue Zhang, Suchen Wang, Shichao Kan, Zhenyu Weng, Yigang Cen, Yap-peng Tan
Abstract summary: Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system. It is impossible to exhaust all pedestrian attributes in the real world. We develop a novel pedestrian open-attribute recognition framework.
Score: 39.399286703315745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system. Existing methods address the PAR problem by training a multi-label classifier with predefined attribute classes. However, it is impossible to exhaust all pedestrian attributes in the real world. To tackle this problem, we develop a novel pedestrian open-attribute recognition (POAR) framework. Our key idea is to formulate the POAR problem as an image-text search problem. We design a Transformer-based image encoder with a masking strategy. A set of attribute tokens are introduced to focus on specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and encode corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings of attributes to find the best attribute descriptions for the input images. Different from existing methods that learn a specific classifier for each attribute category, we model the pedestrian at a part-level and explore the searching method to handle the unseen attributes. Finally, a many-to-many contrastive (MTMC) loss with masked tokens is proposed to train the network since a pedestrian image can comprise multiple attributes. Extensive experiments have been conducted on benchmark PAR datasets with an open-attribute setting. The results verified the effectiveness of the proposed POAR method, which can form a strong baseline for the POAR task. Our code is available at \url{https://github.com/IvyYZ/POAR}.

Related papers

ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition [8.982938200941092]
Pedestrian Attribute Recognition (PAR) aims to identify detailed attributes of an individual, such as clothing, accessories, and gender.<n>ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference.
arXiv Detail & Related papers (2025-06-02T08:07:06Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID. It adopts prompt-tuning strategies to leverage attribute-based text knowledge. Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition [11.316708754749103]
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. We propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information. Our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition.
arXiv Detail & Related papers (2025-02-18T06:39:28Z)
ATPrompt: Textual Prompt Learning with Embedded Attributes [73.1352833091256]
We introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. We transform the text prompt from a category-centric form to an attribute-category hybrid form. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format.
arXiv Detail & Related papers (2024-12-12T16:57:20Z)
MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive) It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z)
Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search [19.610244285078483]
We propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images. We show that our proposed method significantly surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2024-06-06T03:34:42Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. We parse objects and attributes from the description, which are highly likely to exist in the image. Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z)
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm [18.53048511206039]
We propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR. It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts. The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training.
arXiv Detail & Related papers (2023-12-04T05:42:56Z)
Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts. One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different. We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z)
Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information. We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z)
Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling. Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model. Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z)
SMILE: Semantically-guided Multi-attribute Image and Layout Editing [154.69452301122175]
Attribute image manipulation has been a very active topic since the introduction of Generative Adversarial Networks (GANs) We present a multimodal representation that handles all attributes, be it guided by random noise or images, while only using the underlying domain information of the target domain. Our method is capable of adding, removing or changing either fine-grained or coarse attributes by using an image as a reference or by exploring the style distribution space.
arXiv Detail & Related papers (2020-10-05T20:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.