AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
- URL: http://arxiv.org/abs/2105.05165v2
- Date: Wed, 12 May 2021 17:49:10 GMT
- Title: AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
- Authors: Rameswar Panda, Chun-Fu Chen, Quanfu Fan, Ximeng Sun, Kate Saenko,
Aude Oliva, Rogerio Feris
- Abstract summary: We propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
We show that our proposed approach yields 35%-55% reduction in computation when compared to the traditional baseline.
- Score: 61.51188561808917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal learning, which focuses on utilizing various modalities to
improve the performance of a model, is widely used in video recognition. While
traditional multi-modal learning offers excellent recognition results, its
computational expense limits its impact for many real-world applications. In
this paper, we propose an adaptive multi-modal learning framework, called
AdaMML, that selects on-the-fly the optimal modalities for each segment
conditioned on the input for efficient video recognition. Specifically, given a
video segment, a multi-modal policy network is used to decide what modalities
should be used for processing by the recognition model, with the goal of
improving both accuracy and efficiency. We efficiently train the policy network
jointly with the recognition model using standard back-propagation. Extensive
experiments on four challenging diverse datasets demonstrate that our proposed
adaptive approach yields 35%-55% reduction in computation when compared to the
traditional baseline that simply uses all the modalities irrespective of the
input, while also achieving consistent improvements in accuracy over the
state-of-the-art methods.
Related papers
- MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection [10.909746391230206]
Multimodal learning seeks to combine data from multiple input sources to enhance the performance of downstream tasks.
Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination.
We propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario.
arXiv Detail & Related papers (2024-10-03T21:41:12Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion.
We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model.
We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z) - Dynamic Network Quantization for Efficient Video Inference [60.109250720206425]
We propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition.
We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency.
arXiv Detail & Related papers (2021-08-23T20:23:57Z) - HMS: Hierarchical Modality Selection for Efficient Video Recognition [69.2263841472746]
This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition.
HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis.
We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance.
arXiv Detail & Related papers (2021-04-20T04:47:04Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.