Related papers: XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses

XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses

URL: http://arxiv.org/abs/2501.19034v2
Date: Wed, 16 Jul 2025 04:20:58 GMT
Title: XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses
Authors: Bo Lan, Pei Li, Jiaxi Yin, Yunpeng Song, Ge Wang, Han Ding, Jinsong Han, Fei Wang,
Abstract summary: This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization.<n>XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings.<n>To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences.
Score: 16.719450267322653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization using Wi-Fi and IMU signals in smart-home environments , which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and achieves the best performance with an average mAP of 78.74, outperforming the recent WiFiTAD by 5.49 points in mAP@avg while using 35% fewer parameters. In action summarization, we introduce a new metric, Response Meaning Consistency (RMC), to evaluate action summarization performance. And it achieves an average Response Meaning Consistency (mRMC) of 0.802. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more. The data and code are available at https://github.com/aiotgroup/XRFV2.

Related papers

Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning [3.177649348456073]
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring.<n>We propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data.<n>Our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach.
arXiv Detail & Related papers (2025-07-17T18:47:46Z)
Hierarchical and Multimodal Data for Daily Activity Understanding [11.200514097148776]
Daily Activity Recordings for Artificial Intelligence (DARai) is a multimodal dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications.
arXiv Detail & Related papers (2025-04-24T16:04:00Z)
Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts [14.801020598640191]
We propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three prompt strategies-label-only, brief description, and detailed action description.<n>We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), WiFiTAL and XRFV2 for WiFi temporal action localization.
arXiv Detail & Related papers (2025-04-20T13:58:35Z)
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results [55.33807002543901]
We present AIvaluateXR, a comprehensive evaluation framework for benchmarking large language models (LLMs) running on XR devices.<n>We deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation.<n>We propose a unified evaluation method based on the 3D Optimality theory to select the optimal device-model pairs from quality and speed objectives.
arXiv Detail & Related papers (2025-02-13T20:55:48Z)
Scaling Wearable Foundation Models [54.93979158708164]
We investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM. Our results establish the scaling laws of LSM for tasks such as imputation, extrapolation, both across time and sensor modalities.
arXiv Detail & Related papers (2024-10-17T15:08:21Z)
MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition [32.89577715124546]
We propose a novel unsupervised multimodal HAR solution, MaskFi, that leverages only unlabeled video and WiFi activity data for model training. Benefiting from our unsupervised learning procedure, the network requires only a small amount of annotated data for finetuning and can adapt to the new environment with better performance.
arXiv Detail & Related papers (2024-02-29T15:27:55Z)
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model. We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z)
Aria-NeRF: Multimodal Egocentric View Synthesis [17.0554791846124]
We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs) This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, and information from dual-frequency IMU datasets (1kHz and 800Hz) The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in
arXiv Detail & Related papers (2023-11-11T01:56:35Z)
MultiIoT: Benchmarking Machine Learning for the Internet of Things [70.74131118309967]
The next generation of machine learning systems must be adept at perceiving and interacting with the physical world. sensory data from motion, thermal, geolocation, depth, wireless signals, video, and audio are increasingly used to model the states of physical environments. Existing efforts are often specialized to a single sensory modality or prediction task. This paper proposes MultiIoT, the most expansive and unified IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 real-world tasks.
arXiv Detail & Related papers (2023-11-10T18:13:08Z)
Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z)
Contactless Human Activity Recognition using Deep Learning with Flexible and Scalable Software Define Radio [1.3106429146573144]
This study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sensing. These methods avoid additional costly hardware required for vision-based systems, which are privacy-intrusive. This study presents a Wi-Fi CSI-based HAR system that assesses and contrasts deep learning approaches.
arXiv Detail & Related papers (2023-04-18T10:20:14Z)
Variational Autoencoder Assisted Neural Network Likelihood RSRP Prediction Model [2.881201648416745]
We study a generative model for RSRP prediction, exploiting MDT data and a digital twin (DT) Our proposed model that uses real-world data demonstrates an accuracy improvement of about 20% or more compared with the empirical model.
arXiv Detail & Related papers (2022-06-27T17:27:35Z)
WiFi-based Spatiotemporal Human Action Perception [53.41825941088989]
An end-to-end WiFi signal neural network (SNN) is proposed to enable WiFi-only sensing in both line-of-sight and non-line-of-sight scenarios. Especially, the 3D convolution module is able to explore thetemporal continuity of WiFi signals, and the feature self-attention module can explicitly maintain dominant features.
arXiv Detail & Related papers (2022-06-20T16:03:45Z)
A Wireless-Vision Dataset for Privacy Preserving Human Activity Recognition [53.41825941088989]
A new WiFi-based and video-based neural network (WiNN) is proposed to improve the robustness of activity recognition. Our results show that WiVi data set satisfies the primary demand and all three branches in the proposed pipeline keep more than $80%$ of activity recognition accuracy.
arXiv Detail & Related papers (2022-05-24T10:49:11Z)
SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge Devices [69.1412199244903]
We present a multi-tenant runtime for adaptive model execution with integrated MLOps on edge devices, e.g., a camera, a microphone, or IoT sensors. S SensiX++ operates on two fundamental principles - highly modular componentisation to externalise data operations with clear abstractions and document-centric manifestation for system-wide orchestration. We report on the overall throughput and quantified benefits of various automation components of SensiX++ and demonstrate its efficacy to significantly reduce operational complexity and lower the effort to deploy, upgrade, reconfigure and serve embedded models on edge devices.
arXiv Detail & Related papers (2021-09-08T22:06:16Z)
Moving Object Classification with a Sub-6 GHz Massive MIMO Array using Real Data [64.48836187884325]
Classification between different activities in an indoor environment using wireless signals is an emerging technology for various applications. In this paper, we analyze classification of moving objects by employing machine learning on real data from a massive multi-input-multi-output (MIMO) system in an indoor environment.
arXiv Detail & Related papers (2021-02-09T15:48:35Z)
SensiX: A Platform for Collaborative Machine Learning on the Edge [69.1412199244903]
We present SensiX, a personal edge platform that stays between sensor data and sensing models. We demonstrate its efficacy in developing motion and audio-based multi-device sensing systems. Our evaluation shows that SensiX offers a 7-13% increase in overall accuracy and up to 30% increase across different environment dynamics at the expense of 3mW power overhead.
arXiv Detail & Related papers (2020-12-04T23:06:56Z)
Sequential Weakly Labeled Multi-Activity Localization and Recognition on Wearable Sensors using Recurrent Attention Networks [13.64024154785943]
We propose a recurrent attention network (RAN) to handle sequential weakly labeled multi-activity recognition and location tasks. Our RAN model can simultaneously infer multi-activity types from the coarse-grained sequential weak labels. It will greatly reduce the burden of manual labeling.
arXiv Detail & Related papers (2020-04-13T04:57:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.