Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks
- URL: http://arxiv.org/abs/2311.05152v2
- Date: Wed, 20 Dec 2023 23:06:09 GMT
- Title: Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks
- Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
- Abstract summary: This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
- Score: 55.36987468073152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the deployment of large-scale pre-trained models in
audio-visual downstream tasks has yielded remarkable outcomes. However, these
models, primarily trained on single-modality unconstrained datasets, still
encounter challenges in feature extraction for multi-modal tasks, leading to
suboptimal performance. This limitation arises due to the introduction of
irrelevant modality-specific information during encoding, which adversely
affects the performance of downstream tasks. To address this challenge, this
paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention
mechanism. This mechanism leverages audio and visual modalities as soft prompts
to dynamically adjust the parameters of pre-trained models based on the current
multi-modal input features. Specifically, the DG-SCT module incorporates
trainable cross-modal interaction layers into pre-trained audio-visual
encoders, allowing adaptive extraction of crucial information from the current
modality across spatial, channel, and temporal dimensions, while preserving the
frozen parameters of large-scale pre-trained models. Experimental evaluations
demonstrate that our proposed model achieves state-of-the-art results across
multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our
model exhibits promising performance in challenging few-shot and zero-shot
scenarios. The source code and pre-trained models are available at
https://github.com/haoyi-duan/DG-SCT.
Related papers
- Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets.
We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z) - Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Scalable Transformer for High Dimensional Multivariate Time Series Forecasting [10.17270031004674]
This study investigates the reasons behind the suboptimal performance of channel-dependent models on high-dimensional MTS data.
We propose STHD, the Scalable Transformer for High-Dimensional Multidimensional Time Series Forecasting.
Experiments show STHD's considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic.
arXiv Detail & Related papers (2024-08-08T06:17:13Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM)
During pre-training, we curate large-scale datasets with up to 1 billion time points.
To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Revisiting Pre-training in Audio-Visual Learning [6.547660539954143]
We explore the effects of pre-trained models on two audio-visual learning scenarios.
We propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks.
arXiv Detail & Related papers (2023-02-07T15:34:14Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.