Related papers: Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

URL: http://arxiv.org/abs/2601.11931v2
Date: Fri, 23 Jan 2026 11:54:11 GMT
Title: Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition
Authors: Zhengxian Wu, Chuanrui Zhang, Shenao Jiang, Hangrui Xu, Zirui Liao, Luyuan Zhang, Huaqiu Li, Peng Jiao, Haoqian Wang,
Abstract summary: We present a Languageguided and Motion-aware gait recognition framework, named LMGait.<n>In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.<n>We conduct extensive experiments across multiple datasets, and the results demonstrate the significant advantages of our proposed network.
Score: 21.772052273755808
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gait recognition is emerging as a promising technology and an innovative field within computer vision, with a wide range of applications in remote human identification. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion regions, such as the arms and legs. This bottleneck is particularly challenging in the presence of intra-class variation, where gait features of the same individual under different environmental conditions are significantly distant in the feature space. To address the above challenges, we present a Languageguided and Motion-aware gait recognition framework, named LMGait. To the best of our knowledge, LMGait is the first method to introduce natural language descriptions as explicit semantic priors into the gait recognition task. In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences. To improve cross-modal alignment, we propose the Motion Awareness Module (MAM), which refines the language features by adaptively adjusting various levels of semantic information to ensure better alignment with the visual representations. Furthermore, we introduce the Motion Temporal Capture Module (MTCM) to enhance the discriminative capability of gait features and improve the model's motion tracking ability. We conducted extensive experiments across multiple datasets, and the results demonstrate the significant advantages of our proposed network. Specifically, our model achieved accuracies of 88.5%, 97.1%, and 97.5% on the CCPG, SUSTech1K, and CASIAB datasets, respectively, achieving state-of-the-art performance. Homepage: https://dingwu1021.github.io/LMGait/

Related papers

Arabic Sign Language Recognition using Multimodal Approach [0.0]
Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community.<n>Existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras.<n>This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL.
arXiv Detail & Related papers (2026-01-20T09:21:43Z)
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes [54.50887214639301]
We propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords.<n>This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation.<n>In our zero-shot experiments, our model achieves accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively.
arXiv Detail & Related papers (2025-10-31T07:45:44Z)
Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment [6.124050993047708]
WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for AIoT environments.<n>We propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment.<n>Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.
arXiv Detail & Related papers (2025-10-15T10:28:50Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement [25.139037597606233]
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain.<n>Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge.<n>We propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimize both textual prompts and visual representations.
arXiv Detail & Related papers (2025-07-01T13:00:41Z)
SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations [68.9300049150948]
We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID)<n>Existing data collection approaches yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions.<n>We present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood.
arXiv Detail & Related papers (2025-05-04T13:00:29Z)
Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction [4.692621855184482]
Single-Domain Generalized Object Detection(S-DGOD) aims to train an object detector on a single source domain.<n>Recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains.<n>We propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks.
arXiv Detail & Related papers (2025-04-27T02:55:54Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches. Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM. We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z)
Multi-Granularity Language-Guided Training for Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.<n>At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.<n>Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information. Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z)
BigGait: Learning Gait Representation You Want by Large Vision Models [12.620774996969535]
Existing gait recognition methods rely on task-specific upstream driven by supervised learning to provide explicit gait representations. Escaping from this trend, this work proposes a simple yet efficient gait framework, termed BigGait. BigGait transforms all-purpose knowledge into implicit gait representations without requiring third-party supervision signals.
arXiv Detail & Related papers (2024-02-29T13:00:22Z)
Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms. We propose a textbfDomain-Controlled Prompt Learning for the specific domains. Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition [35.642868929840034]
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. We propose a novel and high-performance framework named DyGait to focus on the extraction of dynamic features. Our network achieves an average Rank-1 accuracy of 71.4% on the GREW dataset, 66.3% on the Gait3D dataset, 98.4% on the CASIA-B dataset and 98.3% on the OU-M dataset.
arXiv Detail & Related papers (2023-03-27T07:36:47Z)
Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z)
Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.