IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based
Human Activity Recognition
- URL: http://arxiv.org/abs/2402.01049v1
- Date: Thu, 1 Feb 2024 22:37:33 GMT
- Title: IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based
Human Activity Recognition
- Authors: Zikang Leng, Amitrajit Bhattacharjee, Hrudhai Rajasekhar, Lizhe Zhang,
Elizabeth Bruda, Hyeokhyen Kwon, Thomas Pl\"otz
- Abstract summary: Cross modality transfer approaches convert existing datasets from a source modality, such as video, to a target modality (IMU)
We introduce two new extensions for IMUGPT that enhance its use for practical HAR application scenarios.
We demonstrate that our diversity metrics can reduce the effort needed for the generation of virtual IMU data by at least 50%.
- Score: 0.19791587637442667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the primary challenges in the field of human activity recognition
(HAR) is the lack of large labeled datasets. This hinders the development of
robust and generalizable models. Recently, cross modality transfer approaches
have been explored that can alleviate the problem of data scarcity. These
approaches convert existing datasets from a source modality, such as video, to
a target modality (IMU). With the emergence of generative AI models such as
large language models (LLMs) and text-driven motion synthesis models, language
has become a promising source data modality as well as shown in proof of
concepts such as IMUGPT. In this work, we conduct a large-scale evaluation of
language-based cross modality transfer to determine their effectiveness for
HAR. Based on this study, we introduce two new extensions for IMUGPT that
enhance its use for practical HAR application scenarios: a motion filter
capable of filtering out irrelevant motion sequences to ensure the relevance of
the generated virtual IMU data, and a set of metrics that measure the diversity
of the generated data facilitating the determination of when to stop generating
virtual IMU data for both effective and efficient processing. We demonstrate
that our diversity metrics can reduce the effort needed for the generation of
virtual IMU data by at least 50%, which open up IMUGPT for practical use cases
beyond a mere proof of concept.
Related papers
- Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs [9.570759294459629]
We propose Multi$3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data.
Our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.
arXiv Detail & Related papers (2024-06-03T13:28:42Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - The Unreasonable Effectiveness of Large Language-Vision Models for
Source-free Video Domain Adaptation [56.61543110071199]
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset.
Previous approaches have attempted to address SFVUDA by leveraging self-supervision derived from the target data itself.
We take an approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift.
arXiv Detail & Related papers (2023-08-17T18:12:05Z) - Generating Virtual On-body Accelerometer Data from Virtual Textual
Descriptions for Human Activity Recognition [0.6445605125467573]
We introduce an automated pipeline that generates 3D human motion sequences via a motion model synthesis, T2M-GPT, and later converted to streams of virtual IMU data.
We benchmarked our approach on three HAR datasets (RealWorld, PAMAP2, and USC-HAD) and demonstrate that the use of virtual IMU training data generated using our new approach leads to significantly improved HAR model performance.
arXiv Detail & Related papers (2023-05-04T22:14:44Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - IMUTube: Automatic Extraction of Virtual on-body Accelerometry from
Video for Human Activity Recognition [12.91206329972949]
We introduce IMUTube, an automated processing pipeline to convert videos of human activity into virtual streams of IMU data.
These virtual IMU streams represent accelerometry at a wide variety of locations on the human body.
We show how the virtually-generated IMU data improves the performance of a variety of models on known HAR datasets.
arXiv Detail & Related papers (2020-05-29T21:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.