It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
- URL: http://arxiv.org/abs/2411.10742v1
- Date: Sat, 16 Nov 2024 08:54:27 GMT
- Title: It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment
- Authors: Jinkai Zheng, Xinchen Liu, Boyue Zhang, Chenggang Yan, Jiyong Zhang, Wu Liu, Yongdong Zhang,
- Abstract summary: This paper proposes a novel cross-granularity alignment gait recognition method, named XGait.
To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces.
Comprehensive experiments on two large-scale gait datasets show XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG.
- Score: 72.75844404617959
- License:
- Abstract: Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.
Related papers
- GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition [26.721242606715354]
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns.
We propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA)
First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors.
arXiv Detail & Related papers (2024-07-20T09:05:17Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation [38.39173742709181]
Gait recognition holds the promise to robustly identify subjects based on walking patterns instead of appearance information.
In this work, we propose a novel, point-based Contour-Pose representation, which compactly expresses both body shape and body parts information.
We further propose a local-to-global architecture, called GaitContour, to leverage this novel representation.
arXiv Detail & Related papers (2023-11-27T17:06:25Z) - HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition [3.431054404120758]
We present a multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences for robust gait recognition.
HiH features a main branch that utilizes Hierarchical Gait Decomposer modules for depth-wise and intra-module hierarchical examination of general gait patterns from silhouette data.
An auxiliary branch, based on 2D joint sequences, enriches the spatial and temporal aspects of gait analysis.
arXiv Detail & Related papers (2023-11-19T03:25:14Z) - Parsing is All You Need for Accurate Gait Recognition in the Wild [51.206166843375364]
This paper presents a novel gait representation, named Gait Parsing Sequence (GPS)
GPSs are sequences of fine-grained human segmentation, extracted from video frames, so they have much higher information entropy.
We also propose a novel human parsing-based gait recognition framework, named ParsingGait.
The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait.
arXiv Detail & Related papers (2023-08-31T13:57:38Z) - ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers.
We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - GaitStrip: Gait Recognition via Effective Strip-based Feature
Representations and Multi-Level Framework [34.397404430838286]
We present a strip-based multi-level gait recognition network, named GaitStrip, to extract comprehensive gait information at different levels.
To be specific, our high-level branch explores the context of gait sequences and our low-level one focuses on detailed posture changes.
Our GaitStrip achieves state-of-the-art performance in both normal walking and complex conditions.
arXiv Detail & Related papers (2022-03-08T09:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.