SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition
- URL: http://arxiv.org/abs/2601.12432v1
- Date: Sun, 18 Jan 2026 14:39:02 GMT
- Title: SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition
- Authors: Shunyu Huang, Yunjiao Zhou, Jianfei Yang,
- Abstract summary: Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns.<n>This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges.<n> Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR.
- Score: 20.020503149009787
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Skeleton-based action recognition leverages human pose keypoints to categorize human actions, which shows superior generalization and interoperability compared to regular end-to-end action recognition. Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns, limiting their use in smart homes and hospitals. This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges as a feasible alternative. Two problems are addressed: (1) insufficient data on wireless sensor modality to train an accurate skeleton estimation model, and (2) skeletal keypoints derived from wireless sensors are noisier than RGB, causing great difficulties for subsequent action recognition models. Our work, SkeFi, overcomes these gaps through a novel cross-modal knowledge transfer method acquired from the data-rich RGB modality. We propose the enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) with frame interactive enhancement to overcome the noise from missing or inconsecutive frames. Additionally, our research underscores the effectiveness of enhancing multiscale temporal modeling through dual temporal convolution. By integrating TC-AGC with temporal modeling for cross-modal transfer, our framework can extract accurate poses and actions from noisy wireless sensors. Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR. The code is available at https://github.com/Huang0035/Skefi.
Related papers
- milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion [24.89937570181235]
We present a radar-based 2D human pose estimation framework.<n>We use a Cross-View Fusion Mamba to efficiently extract features from longer sequences.<n>We also incorporate a velocity loss alongside standard keypoint loss during training.
arXiv Detail & Related papers (2025-12-23T07:40:25Z) - Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking [74.15663758681849]
We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker.<n>A frequency-guided visual adapter adaptively transfers complementary cues across modalities.<n>A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context.
arXiv Detail & Related papers (2025-06-30T15:38:26Z) - Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection [5.4610555622532475]
We present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction.<n>We also propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD.
arXiv Detail & Related papers (2024-12-23T03:40:04Z) - WiFi-TCN: Temporal Convolution for Human Interaction Recognition based
on WiFi signal [4.0773490083614075]
Wi-Fi based human activity recognition has gained considerable interest in recent times.
A challenge associated with Wi-Fi-based HAR is the significant decline in performance when the scene or subject changes.
We propose a novel approach that leverages a temporal convolution network with augmentations and attention, referred to as TCN-AA.
arXiv Detail & Related papers (2023-05-21T08:37:32Z) - DynImp: Dynamic Imputation for Wearable Sensing Data Through Sensory and
Temporal Relatedness [78.98998551326812]
We argue that traditional methods have rarely made use of both times-series dynamics of the data as well as the relatedness of the features from different sensors.
We propose a model, termed as DynImp, to handle different time point's missingness with nearest neighbors along feature axis.
We show that the method can exploit the multi-modality features from related sensors and also learn from history time-series dynamics to reconstruct the data under extreme missingness.
arXiv Detail & Related papers (2022-09-26T21:59:14Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - MDPose: Human Skeletal Motion Reconstruction Using WiFi Micro-Doppler
Signatures [4.92674421365689]
We propose MDPose, a novel framework for human skeletal motion reconstruction based on WiFi micro-Doppler signatures.
It provides an effective solution to track human activities by reconstructing a skeleton model with 17 key points.
MDPose outperforms state-of-the-art RF-based pose estimation systems.
arXiv Detail & Related papers (2022-01-11T21:46:28Z) - Cross-modal Knowledge Distillation for Vision-to-Sensor Action
Recognition [12.682984063354748]
This study introduces an end-to-end Vision-to-Sensor Knowledge Distillation (VSKD) framework.
In this VSKD framework, only time-series data, i.e., accelerometer data, is needed from wearable devices during the testing phase.
This framework will not only reduce the computational demands on edge devices, but also produce a learning model that closely matches the performance of the computational expensive multi-modal approach.
arXiv Detail & Related papers (2021-10-08T15:06:38Z) - Feeling of Presence Maximization: mmWave-Enabled Virtual Reality Meets
Deep Reinforcement Learning [76.46530937296066]
This paper investigates the problem of providing ultra-reliable and energy-efficient virtual reality (VR) experiences for wireless mobile users.
To ensure reliable ultra-high-definition (UHD) video frame delivery to mobile users, a coordinated multipoint (CoMP) transmission technique and millimeter wave (mmWave) communications are exploited.
arXiv Detail & Related papers (2021-06-03T08:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.