Revisiting Pre-training in Audio-Visual Learning
- URL: http://arxiv.org/abs/2302.03533v1
- Date: Tue, 7 Feb 2023 15:34:14 GMT
- Title: Revisiting Pre-training in Audio-Visual Learning
- Authors: Ruoxuan Feng, Wenke Xia and Di Hu
- Abstract summary: We explore the effects of pre-trained models on two audio-visual learning scenarios.
We propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks.
- Score: 6.547660539954143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training technique has gained tremendous success in enhancing model
performance on various tasks, but found to perform worse than training from
scratch in some uni-modal situations. This inspires us to think: are the
pre-trained models always effective in the more complex multi-modal scenario,
especially for the heterogeneous modalities such as audio and visual ones? We
find that the answer is No. Specifically, we explore the effects of pre-trained
models on two audio-visual learning scenarios: cross-modal initialization and
multi-modal joint learning. When cross-modal initialization is applied, the
phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the
utilization of model capacity. Thus, we propose Adaptive Batchnorm
Re-initialization (ABRi) to better exploit the capacity of pre-trained models
for target tasks. In multi-modal joint learning, we find a strong pre-trained
uni-modal encoder would bring negative effects on the encoder of another
modality. To alleviate such problem, we introduce a two-stage Fusion Tuning
strategy, taking better advantage of the pre-trained knowledge while making the
uni-modal encoders cooperate with an adaptive masking method. The experiment
results show that our methods could further exploit pre-trained models'
potential and boost performance in audio-visual learning.
Related papers
- Diagnosing and Re-learning for Balanced Multimodal Learning [8.779005254634857]
We propose the Diagnosing & Re-learning method to overcome the imbalanced multimodal learning problem.
The learning state of each modality is estimated based on the separability of its uni-modal representation space.
In this way, the over-emphasizing of scarcely informative modalities is avoided.
arXiv Detail & Related papers (2024-07-12T22:12:03Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Towards Good Practices for Missing Modality Robust Action Recognition [20.26021126604409]
This paper seeks a set of good practices for multi-modal action recognition.
We study how to effectively regularize the model during training.
Second, we investigate on fusion methods for robustness to missing modalities.
Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding.
arXiv Detail & Related papers (2022-11-25T06:10:57Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.