UniPAR: A Unified Framework for Pedestrian Attribute Recognition
- URL: http://arxiv.org/abs/2603.05114v1
- Date: Thu, 05 Mar 2026 12:34:35 GMT
- Title: UniPAR: A Unified Framework for Pedestrian Attribute Recognition
- Authors: Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan, Xiao Wang, ChiaWei Chu, Yu Li,
- Abstract summary: We propose UniPAR, a unified Transformer-based framework for Pedestrian Attribute Recognition.<n>By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets.<n> Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods.
- Score: 14.613498516126498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
Related papers
- EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z) - A Data-Centric Approach to Pedestrian Attribute Recognition: Synthetic Augmentation via Prompt-driven Diffusion Models [41.58360335940522]
Pedestrian Attribute Recognition (PAR) is a challenging task as models are required to generalize across numerous attributes in real-world data.<n>We propose a data-centric approach to improve PAR by synthetic data augmentation guided by textual descriptions.
arXiv Detail & Related papers (2025-09-02T08:56:39Z) - CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z) - Self-Organizing Visual Prototypes for Non-Parametric Representation Learning [6.096888891865663]
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning.<n>In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features.<n>We evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection.
arXiv Detail & Related papers (2025-05-23T20:12:07Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [49.81255045696323]
We present the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet)<n>AuxDet integrates metadata semantics with visual features, guiding adaptive representation learning for each sample.<n>Experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework [20.19599141770658]
Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras.<n>We propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption.<n>Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR.
arXiv Detail & Related papers (2025-04-14T09:22:16Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model [83.35198885088093]
PolSAR data presents unique challenges due to its rich and complex characteristics.<n>Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used.<n>Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively.<n>We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy.
arXiv Detail & Related papers (2024-12-17T09:59:53Z) - FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition [8.444515700910879]
Federated learning enables decentralized clients to collaboratively learn a shared model while keeping all the training data local.<n>In this paper, we address the challenges of non-IID data environments featuring multiple groups of images of different types.<n>We introduce FissionVAE that decouples the latent space and constructs decoder branches tailored to individual client groups.
arXiv Detail & Related papers (2024-08-30T08:22:30Z) - Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework [15.991114464911844]
In the past five years, no large-scale dataset has been opened to the public.
This paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset, MSP60K.
It consists of 60,122 images and 57 attribute annotations across eight scenarios.
arXiv Detail & Related papers (2024-08-19T06:19:31Z) - Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection [6.367999777464464]
multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting.
In this paper, we introduce the Straight-through Gumbel-Softmax framework, offering a comprehensive approach to search multimodal fusion model architectures.
Experiments on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4% achieved with minimal model parameters.
arXiv Detail & Related papers (2024-06-19T09:26:22Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.