VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning
- URL: http://arxiv.org/abs/2309.15494v1
- Date: Wed, 27 Sep 2023 08:44:04 GMT
- Title: VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning
- Authors: Yanan Wang, Donghuo Zeng, Shinya Wada, Satoshi Kurihara
- Abstract summary: Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion.
We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model.
We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
- Score: 6.379202839994046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal transfer learning aims to transform pretrained representations of
diverse modalities into a common domain space for effective multimodal fusion.
However, conventional systems are typically built on the assumption that all
modalities exist, and the lack of modalities always leads to poor inference
performance. Furthermore, extracting pretrained embeddings for all modalities
is computationally inefficient for inference. In this work, to achieve high
efficiency-performance multimodal transfer learning, we propose VideoAdviser, a
video knowledge distillation method to transfer multimodal knowledge of
video-enhanced prompts from a multimodal fundamental model (teacher) to a
specific modal fundamental model (student). With an intuition that the best
learning performance comes with professional advisers and smart students, we
use a CLIP-based teacher model to provide expressive multimodal knowledge
supervision signals to a RoBERTa-based student model via optimizing a
step-distillation objective loss -- first step: the teacher distills multimodal
knowledge of video-enhanced prompts from classification logits to a regression
logit -- second step: the multimodal knowledge is distilled from the regression
logit of the teacher to the student. We evaluate our method in two challenging
multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and
audio-visual retrieval (VEGAS dataset). The student (requiring only the text
modality as input) achieves an MAE score improvement of up to 12.3% for MOSI
and MOSEI. Our method further enhances the state-of-the-art method by 3.4% mAP
score for VEGAS without additional computations for inference. These results
suggest the strengths of our method for achieving high efficiency-performance
multimodal transfer learning.
Related papers
- Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation [30.33381342502258]
Key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing.
We develop the first framework for learning robust segmentor that can handle any combinations of visual modalities.
arXiv Detail & Related papers (2024-11-26T06:15:27Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning [12.00246872965739]
We propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model.
Our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model.
Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information.
arXiv Detail & Related papers (2024-04-16T18:22:49Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unlock the Power: Competitive Distillation for Multi-Modal Large
Language Models [17.25135606956287]
Competitive Multi-modal Distillation framework (CoMD) captures bidirectional feedback between teacher and student models.
Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model.
arXiv Detail & Related papers (2023-11-14T14:49:46Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Improving Multi-Modal Learning with Uni-Modal Teachers [14.917618203952479]
We propose a new multi-modal learning method, Uni-Modal Teacher, which combines the fusion objective and uni-modal distillation to tackle the modality failure problem.
We show that our method not only drastically improves the representation of each modality, but also improves the overall multi-modal task performance.
arXiv Detail & Related papers (2021-06-21T12:46:47Z) - AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [61.51188561808917]
We propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
We show that our proposed approach yields 35%-55% reduction in computation when compared to the traditional baseline.
arXiv Detail & Related papers (2021-05-11T16:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.