Students taught by multimodal teachers are superior action recognizers
- URL: http://arxiv.org/abs/2210.04331v1
- Date: Sun, 9 Oct 2022 19:37:17 GMT
- Title: Students taught by multimodal teachers are superior action recognizers
- Authors: Gorjan Radevski, Dusan Grujicic, Matthew Blaschko, Marie-Francine
Moens, Tinne Tuytelaars
- Abstract summary: The focal point of egocentric video understanding is modelling hand-object interactions.
Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc.
The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time.
- Score: 41.821485757189656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The focal point of egocentric video understanding is modelling hand-object
interactions. Standard models -- CNNs, Vision Transformers, etc. -- which
receive RGB frames as input perform well, however, their performance improves
further by employing additional modalities such as object detections, optical
flow, audio, etc. as input. The added complexity of the required
modality-specific modules, on the other hand, makes these models impractical
for deployment. The goal of this work is to retain the performance of such
multimodal approaches, while using only the RGB images as input at inference
time. Our approach is based on multimodal knowledge distillation, featuring a
multimodal teacher (in the current experiments trained only using object
detections, optical flow and RGB frames) and a unimodal student (using only RGB
frames as input). We present preliminary results which demonstrate that the
resulting model -- distilled from a multimodal teacher -- significantly
outperforms the baseline RGB model (trained without knowledge distillation), as
well as an omnivorous version of itself (trained on all modalities jointly), in
both standard and compositional action recognition.
Related papers
- Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition [12.382193259575805]
We propose a novel multi-modality co-learning (MMCL) framework for efficient skeleton-based action recognition.
Our MMCL framework engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference.
arXiv Detail & Related papers (2024-07-22T15:16:47Z) - Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection [12.462709547836289]
Using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD)
In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder.
This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance.
arXiv Detail & Related papers (2024-04-29T16:42:58Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Multimodal Distillation for Egocentric Action Recognition [41.821485757189656]
egocentric video understanding involves modelling hand-object interactions.
Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well.
But their performance improves further by employing additional input modalities that provide complementary cues.
The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time.
arXiv Detail & Related papers (2023-07-14T17:07:32Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Unified Object Detector for Different Modalities based on Vision
Transformers [1.14219428942199]
We develop a unified detector that achieves superior performance across diverse modalities.
Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors.
We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50.
arXiv Detail & Related papers (2022-07-03T16:01:04Z) - Mutual Modality Learning for Video Action Classification [74.83718206963579]
We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
arXiv Detail & Related papers (2020-11-04T21:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.