Related papers: Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

URL: http://arxiv.org/abs/2408.09720v1
Date: Mon, 19 Aug 2024 06:19:31 GMT
Title: Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework
Authors: Jiandong Jin, Xiao Wang, Qian Zhu, Haiyang Wang, Chenglong Li,
Abstract summary: In the past five years, no large-scale dataset has been opened to the public. This paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset, MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios.
Score: 15.991114464911844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \url{https://github.com/Event-AHU/OpenPAR}.

Related papers

Modeling Saliency Dataset Bias [10.364146597632365]
Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks.<n>We show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias.<n>We propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters.
arXiv Detail & Related papers (2025-05-15T10:55:47Z)
RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework [20.19599141770658]
Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. We propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR.
arXiv Detail & Related papers (2025-04-14T09:22:16Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [32.57246173437492]
This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models [9.974016461777579]
We introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. We also propose a method for dataset translation assisted by Large Language Models (LLMs) We develop a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM)
arXiv Detail & Related papers (2024-07-15T17:50:43Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data [8.905439446173503]
Vision-language models (VLMs) are generally trained on datasets consisting of image-caption pairs obtained from the web. Real-world multimodal datasets, such as healthcare data, are significantly more complex. ViLLA is trained to capture fine-grained region-attribute relationships from complex datasets.
arXiv Detail & Related papers (2023-08-22T05:03:09Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
Multi-Domain Multi-Definition Landmark Localization for Small Datasets [1.2691047660244332]
We present a novel method for multi image domain and multi-landmark definition learning for small dataset facial localization. We propose a Vision Transformer encoder with a novel decoder with a definition shared landmark semantic group structured prior. We show state-of-the-art performance on several varied image domain small datasets for animals, caricatures, and facial portrait paintings.
arXiv Detail & Related papers (2022-03-19T17:09:29Z)
Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets. This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets. In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
A Universal Representation Transformer Layer for Few-Shot Image Classification [43.31379752656756]
Few-shot classification aims to recognize unseen classes when presented with only a small number of samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. Here, we propose a Universal Representation Transformer layer, that meta-learns to leverage universal features for few-shot classification.
arXiv Detail & Related papers (2020-06-21T03:08:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.