Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities
- URL: http://arxiv.org/abs/2504.08578v1
- Date: Fri, 11 Apr 2025 14:30:42 GMT
- Title: Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities
- Authors: Maria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer,
- Abstract summary: We introduce an efficient multimodal knowledge distillation approach for egocentric action recognition.<n>Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model.
- Score: 43.15852057358654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.
Related papers
- MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment [25.542507946327333]
We propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training.<n>We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations.<n>Our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public benchmarks.
arXiv Detail & Related papers (2025-11-21T16:56:25Z) - Cross-Modal Distillation For Widely Differing Modalities [31.049823782188437]
We conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training.<n>This knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting.<n>We propose two soft constrained knowledge distillation strategies at the feature level and a quality-based adaptive weights module to weigh input samples.
arXiv Detail & Related papers (2025-07-22T07:34:00Z) - Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities [3.88369051454137]
Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities.<n>Most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications.<n>Recent works have introduced learnable missing-case-aware prompts to mitigate performance degradation caused by missing modalities.<n>We propose a novel decoupled prototype-based output head, which leverages missing-case-aware class-wise prototypes tailored for each individual modality.
arXiv Detail & Related papers (2025-05-13T06:53:37Z) - Modality-Balanced Learning for Multimedia Recommendation [21.772064939915214]
We propose a Counterfactual Knowledge Distillation method to solve the imbalance problem and make the best use of all modalities.
We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers.
Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones.
arXiv Detail & Related papers (2024-07-26T07:53:01Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - Test-Time Adaptation for Combating Missing Modalities in Egocentric Videos [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.<n>We propose a novel approach to address this issue at test time without requiring retraining.<n>MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning [12.00246872965739]
We propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model.
Our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model.
Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information.
arXiv Detail & Related papers (2024-04-16T18:22:49Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion.
We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model.
We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z) - Elevating Skeleton-Based Action Recognition with Efficient
Multi-Modality Self-Supervision [40.16465314639641]
Self-supervised representation learning for human action recognition has developed rapidly in recent years.
Most of the existing works are based on skeleton data while using a multi-modality setup.
We first propose an Implicit Knowledge Exchange Module which alleviates the propagation of erroneous knowledge between low-performance modalities.
arXiv Detail & Related papers (2023-09-21T12:27:43Z) - Multimodal Distillation for Egocentric Action Recognition [41.821485757189656]
egocentric video understanding involves modelling hand-object interactions.
Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well.
But their performance improves further by employing additional input modalities that provide complementary cues.
The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time.
arXiv Detail & Related papers (2023-07-14T17:07:32Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal
Human Activity Recognition [1.869225486385596]
We explore the hypothesis that leveraging multiple modalities can lead to better recognition.
We extend a number of recent contrastive self-supervised approaches for the task of Human Activity Recognition.
We propose a flexible, general-purpose framework for performing multimodal self-supervised learning.
arXiv Detail & Related papers (2022-05-20T10:39:16Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.