GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild
- URL: http://arxiv.org/abs/2408.06834v1
- Date: Tue, 13 Aug 2024 11:48:28 GMT
- Title: GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild
- Authors: Guozhen Peng, Yunhong Wang, Yuwei Zhao, Shaoxiong Zhang, Annan Li,
- Abstract summary: We design a Global-Local Temporal Receptive Field Network (GLGait) to address the issue of gait recognition in the wild.
GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field.
It can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field.
- Score: 36.04442506043822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.
Related papers
- Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation [17.023625615663665]
Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks.
CNNs struggle to model long-range dependencies explicitly due to the localized nature of convolution operations.
We introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation.
arXiv Detail & Related papers (2024-12-28T14:59:48Z) - LogoRA: Local-Global Representation Alignment for Robust Time Series Classification [31.704294005809082]
Unsupervised domain adaptation (UDA) of time series aims to teach models to identify consistent patterns across various temporal scenarios.
Existing UDA methods struggle to adequately extract and align both global and local features in time series data.
We propose the Local-Global Representation Alignment framework (LogoRA), which employs a two-branch encoder, comprising a multi-scale convolutional branch and a patching transformer branch.
Our evaluations on four time-series datasets demonstrate that LogoRA outperforms strong baselines by up to $12.52%$, showcasing its superiority in time series UDA tasks.
arXiv Detail & Related papers (2024-09-12T13:59:03Z) - Learning Granularity Representation for Temporal Knowledge Graph Completion [2.689675451882683]
Temporal Knowledge Graphs (TKGs) incorporate temporal information to reflect the dynamic structural knowledge and evolutionary patterns of real-world facts.
This paper proposes textbfLearning textbfGranularity textbfRepresentation (termed $mathsfLGRe$) for TKG completion.
It comprises two main components: Granularity Learning (GRL) and Adaptive Granularity Balancing (AGB)
arXiv Detail & Related papers (2024-08-27T08:19:34Z) - LoFormer: Local Frequency Transformer for Image Deblurring [12.032239441930306]
We introduce a novel approach termed Local Frequency Transformer (LoFormer)
Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows.
Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs.
arXiv Detail & Related papers (2024-07-24T04:27:03Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Typhoon Intensity Prediction with Vision Transformer [51.84456610977905]
We introduce "Typhoon Intensity Transformer" (Tint) to predict typhoon intensity accurately across space and time.
Tint uses self-attention mechanisms with global receptive fields per layer.
Experiments on a publicly available typhoon benchmark validate the efficacy of Tint.
arXiv Detail & Related papers (2023-11-28T03:11:33Z) - LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial
Expression Recognition [19.5702895176141]
Previous methods for facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos.
We propose Transformer-based methods for DFER to achieve better performances but result in higher FLOPs and computational costs.
Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.
arXiv Detail & Related papers (2023-05-05T07:53:13Z) - Local-Global Temporal Difference Learning for Satellite Video
Super-Resolution [55.69322525367221]
We propose to exploit the well-defined temporal difference for efficient and effective temporal compensation.
To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies.
Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-10T07:04:40Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Gait Recognition via Effective Global-Local Feature Representation and
Local Temporal Aggregation [28.721376937882958]
Gait recognition is one of the most important biometric technologies and has been applied in many fields.
Recent gait recognition frameworks represent each gait frame by descriptors extracted from either global appearances or local regions of humans.
We propose a novel feature extraction and fusion framework to achieve discriminative feature representations for gait recognition.
arXiv Detail & Related papers (2020-11-03T04:07:13Z) - Dense Residual Network: Enhancing Global Dense Feature Flow for
Character Recognition [75.4027660840568]
This paper explores how to enhance the local and global dense feature flow by exploiting hierarchical features fully from all the convolution layers.
Technically, we propose an efficient and effective CNN framework, i.e., Fast Dense Residual Network (FDRN) for text recognition.
arXiv Detail & Related papers (2020-01-23T06:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.