Related papers: Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning

Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning

URL: http://arxiv.org/abs/2507.13482v1
Date: Thu, 17 Jul 2025 18:47:46 GMT
Title: Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning
Authors: Seyyed Saeid Cheshmi, Buyao Lyu, Thomas Lisko, Rajesh Rajamani, Robert A. McGovern, Yogatheesan Varatharajah,
Abstract summary: Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring.<n>We propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data.<n>Our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach.
Score: 3.177649348456073
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.

Related papers

Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z)
C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition [7.139150172150715]
We introduce Cross-modal Transfer Through Time (C3T)<n>C3T preserves temporal information during alignment to handle dynamic sensor data better.<n>Our experiments on various camera+IMU datasets demonstrate that C3T outperforms existing methods in UMA by at least 8% in accuracy.
arXiv Detail & Related papers (2024-07-23T19:06:44Z)
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition [24.217068565936117]
We present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices. Experiments show our method can achieve state-of-the-art performance on multiple public datasets.
arXiv Detail & Related papers (2024-07-09T07:53:16Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition [14.88934924520362]
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities. We propose a hard negative sampling method for multimodal HAR with a hard negative sampling loss for skeleton and IMU data pairs. We demonstrate the robustness of our approach forlearning strong feature representation for HAR tasks, and on the limited data setting.
arXiv Detail & Related papers (2023-09-03T20:00:37Z)
Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis. Many methods have been proposed to reduce fMRI heterogeneity between source and target domains. But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies. We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Learnable Weight Initialization for Volumetric Medical Image Segmentation [66.3030435676252]
We propose a learnable weight-based hybrid medical image segmentation approach. Our approach is easy to integrate into any hybrid model and requires no external training data. Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-15T17:55:05Z)
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors [8.695342954247606]
We present BenchMD, a benchmark that tests how well unified, modality-agnostic methods, including architectures and training techniques, perform on a diverse array of medical tasks. Our baseline results demonstrate that no unified learning technique achieves strong performance across all modalities, leaving ample room for improvement on the benchmark.
arXiv Detail & Related papers (2023-04-17T17:59:26Z)
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery. wet experiments remain the most reliable method, but they are time-consuming and resource-intensive. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z)
In-Bed Human Pose Estimation from Unseen and Privacy-Preserving Image Domains [22.92165116962952]
In-bed human posture estimation provides important health-related metrics with potential value in medical condition assessments. We propose a multi-modal conditional variational autoencoder (MC-VAE) capable of reconstructing features from missing modalities seen during training. We demonstrate that body positions can be effectively recognized from the available modality, achieving on par results with baseline models.
arXiv Detail & Related papers (2021-11-30T04:56:16Z)
Self-supervised transfer learning of physiological representations from free-living wearable data [12.863826659440026]
We present a novel self-supervised representation learning method using activity and heart rate (HR) signals without semantic labels. We evaluate our model in the largest free-living combined-sensing dataset (comprising >280k hours of wrist accelerometer & wearable ECG data)
arXiv Detail & Related papers (2020-11-18T23:21:34Z)
IMUTube: Automatic Extraction of Virtual on-body Accelerometry from Video for Human Activity Recognition [12.91206329972949]
We introduce IMUTube, an automated processing pipeline to convert videos of human activity into virtual streams of IMU data. These virtual IMU streams represent accelerometry at a wide variety of locations on the human body. We show how the virtually-generated IMU data improves the performance of a variety of models on known HAR datasets.
arXiv Detail & Related papers (2020-05-29T21:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.