SapiensID: Foundation for Human Recognition
- URL: http://arxiv.org/abs/2504.04708v1
- Date: Mon, 07 Apr 2025 03:38:07 GMT
- Title: SapiensID: Foundation for Human Recognition
- Authors: Minchul Kim, Dingqiang Ye, Yiyang Su, Feng Liu, Xiaoming Liu,
- Abstract summary: SapiensID is a unified model for face and body analysis, achieving robust performance across diverse settings.<n>To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations.<n>Experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios.
- Score: 15.65725865703615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.
Related papers
- FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation [42.980289787679084]
Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras.<n>Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations.<n>This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance.
arXiv Detail & Related papers (2025-03-27T15:14:03Z) - A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification [68.06251161453417]
Person Re-identification (ReID) systems identify individuals across images or video frames.<n>Many ReID methods are influenced by sensitive attributes such as gender, pose, and body mass index (BMI)<n>We extend the concept of expressivity to the body recognition domain to better understand how ReID models encode these attributes.
arXiv Detail & Related papers (2025-03-09T05:15:54Z) - Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches [0.0]
We focus on learning persistent body shape characteristics that remain stable over time.<n>We introduce a body identification model based on a Vision Transformer (ViT) and on a Swin-ViT model.<n>All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases.
arXiv Detail & Related papers (2025-02-10T23:49:06Z) - Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification [28.794827024749658]
Pose-dIVE is a novel data augmentation approach that incorporates sparse and underrepresented human pose and camera viewpoint examples into the training data.
Our objective is to augment the training dataset to enable existing Re-ID models to learn features unbiased by human pose and camera viewpoint variations.
arXiv Detail & Related papers (2024-06-23T07:48:21Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification [51.89551385538251]
We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos.
VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features.
Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space.
arXiv Detail & Related papers (2023-11-27T19:30:30Z) - Shape-Erased Feature Learning for Visible-Infrared Person
Re-Identification [90.39454748065558]
Body shape is one of the significant modality-shared cues for VI-ReID.
We propose shape-erased feature learning paradigm that decorrelates modality-shared features in two subspaces.
Experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-09T10:22:10Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z) - Learning Diversified Feature Representations for Facial Expression
Recognition in the Wild [97.14064057840089]
We propose a mechanism to diversify the features extracted by CNN layers of state-of-the-art facial expression recognition architectures.
Experimental results on three well-known facial expression recognition in-the-wild datasets, AffectNet, FER+, and RAF-DB, show the effectiveness of our method.
arXiv Detail & Related papers (2022-10-17T19:25:28Z) - Pose Invariant Person Re-Identification using Robust Pose-transformation
GAN [11.338815177557645]
Person re-identification (re-ID) aims to retrieve a person's images from an image gallery, given a single instance of the person of interest.
Despite several advancements, learning discriminative identity-sensitive and viewpoint invariant features for robust Person Re-identification is a major challenge owing to large pose variation of humans.
This paper proposes a re-ID pipeline that utilizes the image generation capability of Generative Adversarial Networks combined with pose regression and feature fusion to achieve pose invariant feature learning.
arXiv Detail & Related papers (2021-04-11T15:47:03Z) - View-Invariant Gait Recognition with Attentive Recurrent Learning of
Partial Representations [27.33579145744285]
We propose a network that first learns to extract gait convolutional energy maps (GCEM) from frame-level convolutional features.
It then adopts a bidirectional neural network to learn from split bins of the GCEM, thus exploiting the relations between learned partial recurrent representations.
Our proposed model has been extensively tested on two large-scale CASIA-B and OU-M gait datasets.
arXiv Detail & Related papers (2020-10-18T20:20:43Z) - Cross-Resolution Adversarial Dual Network for Person Re-Identification
and Beyond [59.149653740463435]
Person re-identification (re-ID) aims at matching images of the same person across camera views.
Due to varying distances between cameras and persons of interest, resolution mismatch can be expected.
We propose a novel generative adversarial network to address cross-resolution person re-ID.
arXiv Detail & Related papers (2020-02-19T07:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.