SensorLM: Learning the Language of Wearable Sensors
- URL: http://arxiv.org/abs/2506.09108v1
- Date: Tue, 10 Jun 2025 17:13:09 GMT
- Title: SensorLM: Learning the Language of Wearable Sensors
- Authors: Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed A. Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang,
- Abstract summary: We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language.<n>We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data.<n>This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people.
- Score: 50.95988682423808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
Related papers
- Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning [61.17099595835263]
Gensors is a system that empowers users to define customized sensors supported by the reasoning capabilities of MLLMs.<n>In a user study, participants reported significantly greater sense of control, understanding, and ease of communication when defining sensors using Gensors.
arXiv Detail & Related papers (2025-01-27T01:47:57Z) - Scaling Wearable Foundation Models [54.93979158708164]
We investigate the scaling properties of sensor foundation models across compute, data, and model size.
Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM.
Our results establish the scaling laws of LSM for tasks such as imputation, extrapolation, both across time and sensor modalities.
arXiv Detail & Related papers (2024-10-17T15:08:21Z) - SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing [6.8009140511761546]
Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems.<n>We construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective.<n>The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks.
arXiv Detail & Related papers (2024-10-14T17:21:39Z) - SensorLLM: Human-Intuitive Alignment of Multivariate Sensor Data with LLMs for Activity Recognition [9.072495000412943]
We introduce SensorLLM, a framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from wearable sensor data.<n>We construct SensorQA, a question-answering dataset of human-intuitive sensor-text pairs spanning diverse HAR scenarios.<n>Our results show that, guided by human-intuitive alignment, SensorLLM becomes an effective sensor learner, reasoner, and classifier--generalizing across varied HAR settings.
arXiv Detail & Related papers (2024-10-14T15:30:41Z) - Layout Agnostic Human Activity Recognition in Smart Homes through Textual Descriptions Of Sensor Triggers (TDOST) [0.22354214294493352]
We develop a layout-agnostic modeling approach for human activity recognition (HAR) systems in smart homes.
We generate Textual Descriptions Of Sensor Triggers (TDOST) that encapsulate the surrounding trigger conditions.
We demonstrate the effectiveness of TDOST-based models in unseen smart homes through experiments on benchmarked CASAS datasets.
arXiv Detail & Related papers (2024-05-20T20:37:44Z) - MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z) - Unsupervised Statistical Feature-Guided Diffusion Model for Sensor-based Human Activity Recognition [3.2319909486685354]
A key problem holding up progress in wearable sensor-based human activity recognition is the unavailability of diverse and labeled training data.
We propose an unsupervised statistical feature-guided diffusion model specifically optimized for wearable sensor-based human activity recognition.
By conditioning the diffusion model on statistical information such as mean, standard deviation, Z-score, and skewness, we generate diverse and representative synthetic sensor data.
arXiv Detail & Related papers (2023-05-30T15:12:59Z) - On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks [61.74608497496841]
Training on inaccurate or corrupt data induces model bias and hampers generalisation capabilities.
This paper investigates the effect of sensor errors for the dense 3D vision tasks of depth estimation and reconstruction.
arXiv Detail & Related papers (2023-03-26T22:32:44Z) - The LuViRA Dataset: Synchronized Vision, Radio, and Audio Sensors for Indoor Localization [41.58739817444644]
The dataset includes color images, corresponding depth maps, inertial measurement unit (IMU) readings, channel response between a 5G massive multiple-input and multiple-output (MIMO) testbed and user equipment.
We synchronize these sensors to ensure that all data is recorded simultaneously.
The main aim of this dataset is to enable research on sensor fusion with the most commonly used sensors for localization tasks.
arXiv Detail & Related papers (2023-02-10T15:12:40Z) - Using Language Model to Bootstrap Human Activity Recognition Ambient
Sensors Based in Smart Homes [2.336163487623381]
We propose two Natural Language Processing embedding methods to enhance LSTM-based structures in activity-sequences classification tasks.
Results indicate that this approach provides useful information, such as a sensor organization map.
Our tests show that the embeddings can be pretrained on different datasets than the target one, enabling transfer learning.
arXiv Detail & Related papers (2021-11-23T21:21:14Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.