Related papers: MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

URL: http://arxiv.org/abs/2406.03857v1
Date: Thu, 6 Jun 2024 08:42:36 GMT
Title: MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Authors: Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz,
Abstract summary: Human Activity Recognition is a longstanding problem in AI with applications in a broad range of areas. We show how we can improve HAR performance across different modalities using multimodal contrastive pretraining.
Score: 2.7532797256542403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human Activity Recognition is a longstanding problem in AI with applications in a broad range of areas: from healthcare, sports and fitness, security, and human computer interaction to robotics. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundational models (e.g., CLIP), can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g, in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. In this work, we show how we can improve HAR performance across different modalities using multimodal contrastive pretraining. Our approach MuJo (Multimodal Joint Feature Space Learning), learns a multimodal joint feature space with video, language, pose, and IMU sensor data. The proposed approach combines contrastive and multitask learning methods and analyzes different multitasking strategies for learning a compact shared representation. A large dataset with parallel video, language, pose, and sensor data points is also introduced to support the research, along with an analysis of the robustness of the multimodal joint space for modal-incomplete and low-resource data. On the MM-Fit dataset, our model achieves an impressive Macro F1-Score of up to 0.992 with only 2% of the train data and 0.999 when using all available training data for classification tasks. Moreover, in the scenario where the MM-Fit dataset is unseen, we demonstrate a generalization performance of up to 0.638.

Related papers

PIM: Physics-Informed Multi-task Pre-training for Improving Inertial Sensor-Based Human Activity Recognition [4.503003860563811]
We propose a physics-informed multi-task pre-training (PIM) framework for IMU-based human activity recognition (HAR) PIM generates pre-text tasks based on the understanding of basic physical aspects of human motion. We have observed gains of almost 10% in macro f1 score and accuracy with only 2 to 8 labeled examples per class.
arXiv Detail & Related papers (2025-03-23T08:16:01Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs [9.570759294459629]
We propose Multi$3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data. Our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.
arXiv Detail & Related papers (2024-06-03T13:28:42Z)
Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. We propose a novel approach to address this issue at test time without requiring retraining. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z)
MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition [32.89577715124546]
We propose a novel unsupervised multimodal HAR solution, MaskFi, that leverages only unlabeled video and WiFi activity data for model training. Benefiting from our unsupervised learning procedure, the network requires only a small amount of annotated data for finetuning and can adapt to the new environment with better performance.
arXiv Detail & Related papers (2024-02-29T15:27:55Z)
Distribution Matching for Multi-Task Learning of Classification Tasks: a Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space. We show that MTL can be successful with classification tasks with little, or non-overlapping annotations. We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z)
Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for HAR [0.0]
We show how real data can be used for self-supervised learning without any transformations. Our approach involves contrastive matching of two different sensors. We test our approach on the Opportunity and MM-Fit datasets.
arXiv Detail & Related papers (2023-11-21T15:31:16Z)
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM) ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition [6.0306313759213275]
We propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors. Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features. We show significant improvements of 22% and 11% compared to video only, and 20% and 12% on MMAct datasets.
arXiv Detail & Related papers (2022-11-08T15:48:44Z)
Progressive Cross-modal Knowledge Distillation for Human Action Recognition [10.269019492921306]
We propose a novel Progressive Skeleton-to-sensor Knowledge Distillation (PSKD) model for solving the wearable sensor-based HAR problem. Specifically, we construct multiple teacher models using data from both teacher (human skeleton sequence) and student (time-series accelerometer data) modalities.
arXiv Detail & Related papers (2022-08-17T06:06:03Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
Families In Wild Multimedia: A Multimodal Database for Recognizing Kinship [63.27052967981546]
We introduce the first publicly available multi-task MM kinship dataset. To build FIW MM, we developed machinery to automatically collect, annotate, and prepare the data. Results highlight edge cases to inspire future research with different areas of improvement.
arXiv Detail & Related papers (2020-07-28T22:36:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.