FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition
- URL: http://arxiv.org/abs/2506.22836v1
- Date: Sat, 28 Jun 2025 10:38:54 GMT
- Title: FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition
- Authors: Hongyan An, Kuan Zhu, Xin He, Haiyun Guo, Chaoyang Zhao, Ming Tang, Jinqiao Wang,
- Abstract summary: Pedestrian attribute recognition is a fundamental perception task in intelligent transportation and security.<n>To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information.<n>We propose the textbfFine-grained textbfOptimization with semantitextbfC gtextbfUided undertextbfStanding (FOCUS) approach for PAR.
- Score: 40.85042685914472
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Pedestrian attribute recognition (PAR) is a fundamental perception task in intelligent transportation and security. To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information. However, a regional feature is typically used to predict a fixed set of pre-defined attributes in these methods, which limits the performance and practicality in two aspects: 1) Regional features may compromise fine-grained patterns unique to certain attributes in favor of capturing common characteristics shared across attributes. 2) Regional features cannot generalize to predict unseen attributes in the test time. In this paper, we propose the \textbf{F}ine-grained \textbf{O}ptimization with semanti\textbf{C} g\textbf{U}ided under\textbf{S}tanding (FOCUS) approach for PAR, which adaptively extracts fine-grained attribute-level features for each attribute individually, regardless of whether the attributes are seen or not during training. Specifically, we propose the Multi-Granularity Mix Tokens (MGMT) to capture latent features at varying levels of visual granularity, thereby enriching the diversity of the extracted information. Next, we introduce the Attribute-guided Visual Feature Extraction (AVFE) module, which leverages textual attributes as queries to retrieve their corresponding visual attribute features from the Mix Tokens using a cross-attention mechanism. To ensure that textual attributes focus on the appropriate Mix Tokens, we further incorporate a Region-Aware Contrastive Learning (RACL) method, encouraging attributes within the same region to share consistent attention maps. Extensive experiments on PA100K, PETA, and RAPv1 datasets demonstrate the effectiveness and strong generalization ability of our method.
Related papers
- ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition [8.982938200941092]
Pedestrian Attribute Recognition (PAR) aims to identify detailed attributes of an individual, such as clothing, accessories, and gender.<n>ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference.
arXiv Detail & Related papers (2025-06-02T08:07:06Z) - LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.<n>It adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z) - A Solution to Co-occurrence Bias: Attributes Disentanglement via Mutual
Information Minimization for Pedestrian Attribute Recognition [10.821982414387525]
We show that current methods can actually suffer in generalizing such fitted attributes interdependencies onto scenes or identities off the dataset distribution.
To render models robust in realistic scenes, we propose the attributes-disentangled feature learning to ensure the recognition of an attribute not inferring on the existence of others.
arXiv Detail & Related papers (2023-07-28T01:34:55Z) - ASD: Towards Attribute Spatial Decomposition for Prior-Free Facial
Attribute Recognition [11.757112726108822]
Representing the spatial properties of facial attributes is a vital challenge for facial attribute recognition (FAR)
Recent advances have achieved the reliable performances for FAR, benefiting from the description of spatial properties via extra prior information.
We propose a prior-free method for attribute spatial decomposition (ASD), mitigating the spatial ambiguity of facial attributes without any extra prior information.
arXiv Detail & Related papers (2022-10-25T02:25:05Z) - TransFA: Transformer-based Representation for Face Attribute Evaluation [87.09529826340304]
We propose a novel textbftransformer-based representation for textbfattribute evaluation method (textbfTransFA)
The proposed TransFA achieves superior performances compared with state-of-the-art methods.
arXiv Detail & Related papers (2022-07-12T10:58:06Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Spatial and Semantic Consistency Regularizations for Pedestrian
Attribute Recognition [50.932864767867365]
We propose a framework that consists of two complementary regularizations to achieve spatial and semantic consistency for each attribute.
Based on the precise attribute locations, we propose a semantic consistency regularization to extract intrinsic and discriminative semantic features.
Results show that the proposed method performs favorably against state-of-the-art methods without increasing parameters.
arXiv Detail & Related papers (2021-09-13T03:36:44Z) - Pedestrian Attribute Recognition in Video Surveillance Scenarios Based
on View-attribute Attention Localization [8.807717261983539]
We propose a novel view-attribute localization method based on attention (VALA)
A specific view-attribute is composed by the extracted attribute feature and four view scores which are predicted by view predictor as the confidences for attribute from different views.
Experiments on three wide datasets (RAP, RAPv2, PETA, and PA-100K) demonstrate the effectiveness of our approach compared with state-of-the-art methods.
arXiv Detail & Related papers (2021-06-11T16:09:31Z) - Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot
Recognition [27.0842107128122]
We devise an attributes-guided attention module (AGAM) to utilize human-annotated attributes and learn more discriminative features.
Our proposed module can significantly improve simple metric-based approaches to achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-09-10T08:38:32Z) - Attribute Mix: Semantic Data Augmentation for Fine Grained Recognition [102.45926816660665]
We propose Attribute Mix, a data augmentation strategy at attribute level to expand the fine-grained samples.
The principle lies in that attribute features are shared among fine-grained sub-categories, and can be seamlessly transferred among images.
arXiv Detail & Related papers (2020-04-06T14:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.