Mutual Modality Learning for Video Action Classification
- URL: http://arxiv.org/abs/2011.02543v1
- Date: Wed, 4 Nov 2020 21:20:08 GMT
- Title: Mutual Modality Learning for Video Action Classification
- Authors: Stepan Komkov, Maksim Dzabraev, Aleksandr Petiushko
- Abstract summary: We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
- Score: 74.83718206963579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The construction of models for video action classification progresses
rapidly. However, the performance of those models can still be easily improved
by ensembling with the same models trained on different modalities (e.g.
Optical flow). Unfortunately, it is computationally expensive to use several
modalities during inference. Recent works examine the ways to integrate
advantages of multi-modality into a single RGB-model. Yet, there is still a
room for improvement. In this paper, we explore the various methods to embed
the ensemble power into a single model. We show that proper initialization, as
well as mutual modality learning, enhances single-modality models. As a result,
we achieve state-of-the-art results in the Something-Something-v2 benchmark.
Related papers
- Exploring Model Kinship for Merging Large Language Models [52.01652098827454]
We introduce model kinship, the degree of similarity or relatedness between Large Language Models.
We find that there is a certain relationship between model kinship and the performance gains after model merging.
We propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets.
arXiv Detail & Related papers (2024-10-16T14:29:29Z) - Fine-Grained Scene Image Classification with Modality-Agnostic Adapter [8.801601759337006]
We present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter)
We eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion.
Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods.
arXiv Detail & Related papers (2024-07-03T02:57:14Z) - Mutual Learning for Finetuning Click-Through Rate Prediction Models [0.0]
In this paper, we show how useful the mutual learning algorithm could be when it is between equals.
In our experiments on the Criteo and Avazu datasets, the mutual learning algorithm improved the performance of the model by up to 0.66% relative improvement.
arXiv Detail & Related papers (2024-06-17T20:56:30Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Multimodal Distillation for Egocentric Action Recognition [41.821485757189656]
egocentric video understanding involves modelling hand-object interactions.
Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well.
But their performance improves further by employing additional input modalities that provide complementary cues.
The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time.
arXiv Detail & Related papers (2023-07-14T17:07:32Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Towards Good Practices for Missing Modality Robust Action Recognition [20.26021126604409]
This paper seeks a set of good practices for multi-modal action recognition.
We study how to effectively regularize the model during training.
Second, we investigate on fusion methods for robustness to missing modalities.
Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding.
arXiv Detail & Related papers (2022-11-25T06:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.