IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing
Applications
- URL: http://arxiv.org/abs/2209.00945v2
- Date: Thu, 29 Feb 2024 12:20:59 GMT
- Title: IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing
Applications
- Authors: Hyungjun Yoon, Hyeongheon Cha, Hoang C. Nguyen, Taesik Gong, Sung-Ju
Lee
- Abstract summary: We propose IMG2IMU that adapts pre-trained representation from large-scale images to diverse IMU sensing tasks.
We convert the sensor data into visually interpretable spectrograms for the model to utilize the knowledge gained from vision.
IMG2IMU outperforms the baselines pre-trained on sensor data by an average of 9.6%p F1-score.
- Score: 6.865654843241631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training representations acquired via self-supervised learning could
achieve high accuracy on even tasks with small training data. Unlike in vision
and natural language processing domains, pre-training for IMU-based
applications is challenging, as there are few public datasets with sufficient
size and diversity to learn generalizable representations. To overcome this
problem, we propose IMG2IMU that adapts pre-trained representation from
large-scale images to diverse IMU sensing tasks. We convert the sensor data
into visually interpretable spectrograms for the model to utilize the knowledge
gained from vision. We further present a sensor-aware pre-training method for
images that enables models to acquire particularly impactful knowledge for IMU
sensing applications. This involves using contrastive learning on our
augmentation set customized for the properties of sensor data. Our evaluation
with four different IMU sensing tasks shows that IMG2IMU outperforms the
baselines pre-trained on sensor data by an average of 9.6%p F1-score,
illustrating that vision knowledge can be usefully incorporated into IMU
sensing applications where only limited training data is available.
Related papers
- PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision [7.896850422430362]
Inertial Measurement Units (IMUs) embedded in personal devices have enabled significant applications in health and wellness.
While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions.
For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data.
This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and
arXiv Detail & Related papers (2024-11-22T18:46:30Z) - Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition [24.217068565936117]
We present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video.
To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices.
Experiments show our method can achieve state-of-the-art performance on multiple public datasets.
arXiv Detail & Related papers (2024-07-09T07:53:16Z) - MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition [2.7532797256542403]
Human Activity Recognition (HAR) is a longstanding problem in AI with applications in a broad range of areas, including healthcare, sports and fitness, security, and more.
We introduce our comprehensive Fitness Multimodal Activity dataset (FiMAD) to enhance HAR performance across various modalities.
We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH.
arXiv Detail & Related papers (2024-06-06T08:42:36Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular
Depth Estimation by Integrating IMU Motion Dynamics [74.1720528573331]
Unsupervised monocular depth and ego-motion estimation has drawn extensive research attention in recent years.
We propose DynaDepth, a novel scale-aware framework that integrates information from vision and IMU motion dynamics.
We validate the effectiveness of DynaDepth by conducting extensive experiments and simulations on the KITTI and Make3D datasets.
arXiv Detail & Related papers (2022-07-11T07:50:22Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z) - DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning [83.48587570246231]
Visual Similarity plays an important role in many computer vision applications.
Deep metric learning (DML) is a powerful framework for learning such similarities.
We propose and study multiple complementary learning tasks, targeting conceptually different data relationships.
We learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance.
arXiv Detail & Related papers (2020-04-28T12:26:50Z) - A Deep Learning Method for Complex Human Activity Recognition Using
Virtual Wearable Sensors [22.923108537119685]
Sensor-based human activity recognition (HAR) is now a research hotspot in multiple application areas.
We propose a novel method based on deep learning for complex HAR in the real-scene.
The proposed method can surprisingly converge in a few iterations and achieve an accuracy of 91.15% on a real IMU dataset.
arXiv Detail & Related papers (2020-03-04T03:31:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.