MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
- URL: http://arxiv.org/abs/2406.03857v3
- Date: Thu, 06 Feb 2025 12:37:09 GMT
- Title: MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
- Authors: Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz,
- Abstract summary: Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas.
We introduce our comprehensive Fitness Multimodal Activity dataset (FiMAD) to enhance HAR performance across various modalities.
We show that FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH.
- Score: 2.7532797256542403
- License:
- Abstract: Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve a Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the complete training set for classification tasks. We compare our approach with other self-supervised ones and show that, unlike them, ours consistently improves compared to the baseline network performance while also providing better data efficiency.
Related papers
- Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs [9.570759294459629]
We propose Multi$3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data.
Our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.
arXiv Detail & Related papers (2024-06-03T13:28:42Z) - MaskFi: Unsupervised Learning of WiFi and Vision Representations for
Multimodal Human Activity Recognition [32.89577715124546]
We propose a novel unsupervised multimodal HAR solution, MaskFi, that leverages only unlabeled video and WiFi activity data for model training.
Benefiting from our unsupervised learning procedure, the network requires only a small amount of annotated data for finetuning and can adapt to the new environment with better performance.
arXiv Detail & Related papers (2024-02-29T15:27:55Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for
HAR [0.0]
We show how real data can be used for self-supervised learning without any transformations.
Our approach involves contrastive matching of two different sensors.
We test our approach on the Opportunity and MM-Fit datasets.
arXiv Detail & Related papers (2023-11-21T15:31:16Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity
Recognition [6.0306313759213275]
We propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors.
Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features.
We show significant improvements of 22% and 11% compared to video only, and 20% and 12% on MMAct datasets.
arXiv Detail & Related papers (2022-11-08T15:48:44Z) - Progressive Cross-modal Knowledge Distillation for Human Action
Recognition [10.269019492921306]
We propose a novel Progressive Skeleton-to-sensor Knowledge Distillation (PSKD) model for solving the wearable sensor-based HAR problem.
Specifically, we construct multiple teacher models using data from both teacher (human skeleton sequence) and student (time-series accelerometer data) modalities.
arXiv Detail & Related papers (2022-08-17T06:06:03Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - Families In Wild Multimedia: A Multimodal Database for Recognizing
Kinship [63.27052967981546]
We introduce the first publicly available multi-task MM kinship dataset.
To build FIW MM, we developed machinery to automatically collect, annotate, and prepare the data.
Results highlight edge cases to inspire future research with different areas of improvement.
arXiv Detail & Related papers (2020-07-28T22:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.