Related papers: MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding

MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding

URL: http://arxiv.org/abs/2512.21506v1
Date: Thu, 25 Dec 2025 04:37:07 GMT
Title: MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding
Authors: Aiwei Zhang, Arvind Pillai, Andrew Campbell, Nicholas C. Jacobson,
Abstract summary: MotionTeller is a generative framework that integrates minute-level wearable activity data with large language models (LLMs)<n>We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens.<n>MotionTeller achieves high semantic fidelity (BERT-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1.
Score: 4.158479111055355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

Related papers

LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models [5.041844772782674]
We introduce LENS, a framework that aligns multimodal sensing data with language models to generate mental-health narratives.<n>LENS constructs a large-scale dataset by transforming responses related to depression and anxiety symptoms into natural-language descriptions.<n>Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy.
arXiv Detail & Related papers (2025-12-28T18:00:57Z)
ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning [18.435889278351297]
ESPADA is a semantically aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations.<n>To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping.<n> ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
arXiv Detail & Related papers (2025-12-08T10:08:33Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation [10.122882293302787]
temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
arXiv Detail & Related papers (2025-07-01T13:55:57Z)
SensorLM: Learning the Language of Wearable Sensors [50.95988682423808]
We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language.<n>We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data.<n>This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people.
arXiv Detail & Related papers (2025-06-10T17:13:09Z)
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning [66.53533434848369]
We propose a motion-guided self-learning framework that learns densely consistent representations.<n>We improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks.
arXiv Detail & Related papers (2025-06-10T11:20:32Z)
MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition [16.764919107712146]
Motion-Primitive Transformer (MoPFormer) is a novel framework that enhances interpretability by tokenizing inertial measurement unit signals into meaningful motion primitives.<n>MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives.<n>Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets.
arXiv Detail & Related papers (2025-05-27T05:34:56Z)
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking [58.719310295870024]
This paper presents an event-based framework for tracking any point.<n>To resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process.<n>The method improves the $Survival_50$ metric by 17.9% over event-only tracking of any point baseline.
arXiv Detail & Related papers (2024-12-02T09:13:29Z)
Scaling Wearable Foundation Models [54.93979158708164]
We investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM. Our results establish the scaling laws of LSM for tasks such as imputation, extrapolation, both across time and sensor modalities.
arXiv Detail & Related papers (2024-10-17T15:08:21Z)
iKUN: Speak to Trackers without Retraining [21.555469501789577]
We propose an insertable Knowledge Unification Network, termed iKUN, to enable communication with off-the-shelf trackers. To improve the localization accuracy, we present a neural version of Kalman filter (NKF) to dynamically adjust process noise. We also contribute a more challenging dataset, Refer-Dance, by extending public DanceTrack dataset with motion and dressing descriptions.
arXiv Detail & Related papers (2023-12-25T11:48:55Z)
Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix [59.55173022987071]
We study the potential of semi-supervised learning for class-agnostic motion prediction. Our framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data. Our method exhibits comparable performance to weakly and some fully supervised methods.
arXiv Detail & Related papers (2023-12-13T09:32:50Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.