VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation
- URL: http://arxiv.org/abs/2507.09672v1
- Date: Sun, 13 Jul 2025 15:11:18 GMT
- Title: VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation
- Authors: Xinyu Zhang, Zhonghao Ye, Jingwei Zhang, Xiang Tian, Zhisheng Liang, Shipeng Yu,
- Abstract summary: VST-Pose is a novel framework for accurate pose estimation using WiFi channel state information.<n>Our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset.<n>The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments.
- Score: 10.371863139473456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in https://github.com/CarmenQing/VST-Pose.
Related papers
- StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion [29.682018018059043]
StarPose is an autoregressive diffusion framework for 3D human pose estimation.<n>It incorporates historical 3D pose predictions and spatial-temporal physical guidance.<n>It achieves superior accuracy and temporal consistency in 3D human pose estimation.
arXiv Detail & Related papers (2025-08-04T04:50:05Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation [67.12857074801731]
We introduce a novel method, CPPF++, designed for sim-to-real pose estimation.
To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty.
We incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a feature ensemble.
arXiv Detail & Related papers (2022-11-24T03:27:00Z) - (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network
Based On Transformer for 3D Human Pose Estimation [1.52292571922932]
Many previous methods lack the understanding of local joint information.cite8888987considers the temporal relationship of a single joint in this work.
Our proposed textbfFusionformer method introduces a global-temporal self-trajectory module and a cross-temporal self-trajectory module.
The results show an improvement of 2.4% MPJPE and 4.3% P-MPJPE on the Human3.6M dataset.
arXiv Detail & Related papers (2022-10-08T12:22:10Z) - An Online Semantic Mapping System for Extending and Enhancing Visual
SLAM [2.538209532048867]
We present a real-time semantic mapping approach for mobile vision systems with a 2D to 3D object detection pipeline and rapid data association for generated landmarks.
Our system reaches real-time capabilities with an average iteration duration of 65ms and is able to improve the pose estimation of a state-of-the-art SLAM by up to 68% on a public dataset.
arXiv Detail & Related papers (2022-03-08T09:14:37Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Enhanced 3D Human Pose Estimation from Videos by using Attention-Based
Neural Network with Dilated Convolutions [12.900524511984798]
We show a systematic design for how conventional networks and other forms of constraints can be incorporated into the attention framework.
We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions.
Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
arXiv Detail & Related papers (2021-03-04T17:26:51Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.