COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
- URL: http://arxiv.org/abs/2503.07259v1
- Date: Mon, 10 Mar 2025 12:43:51 GMT
- Title: COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
- Authors: Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim,
- Abstract summary: We propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations.<n>Our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications.
- Score: 3.271109623410664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .
Related papers
- Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [7.667819384855409]
We present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning.
By simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.
Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning.
arXiv Detail & Related papers (2025-04-03T17:38:59Z) - Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals [13.202236467650033]
Estimating motion in videos is an essential computer vision problem with many downstream applications.
We develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model.
We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.
arXiv Detail & Related papers (2025-03-25T17:58:52Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition [24.217068565936117]
We present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video.
To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices.
Experiments show our method can achieve state-of-the-art performance on multiple public datasets.
arXiv Detail & Related papers (2024-07-09T07:53:16Z) - Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs [9.570759294459629]
We propose Multi$3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data.
Our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.
arXiv Detail & Related papers (2024-06-03T13:28:42Z) - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization [56.75782714530429]
We propose a cross-modal adaptation framework, which we call X-MIC.
Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space.
This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
arXiv Detail & Related papers (2024-03-28T19:45:35Z) - AICL: Action In-Context Learning for Video Diffusion Model [124.39948693332552]
We propose AICL, which empowers the generative model with the ability to understand action information in reference videos.
Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance.
arXiv Detail & Related papers (2024-03-18T07:41:19Z) - MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic
Facial Expression Recognition [47.29528724322795]
MAE-DFER is a novel self-supervised method for learning dynamic facial expressions.
It uses large-scale self-supervised pre-training on abundant unlabeled data.
It consistently outperforms state-of-the-art supervised methods.
arXiv Detail & Related papers (2023-07-05T12:08:56Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.