Cross-Modal Adapter for Text-Video Retrieval
- URL: http://arxiv.org/abs/2211.09623v1
- Date: Thu, 17 Nov 2022 16:15:30 GMT
- Title: Cross-Modal Adapter for Text-Video Retrieval
- Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen
Lu, Jie Zhou, Shiji Song, Gao Huang
- Abstract summary: We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
- Score: 91.9575196703281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-video retrieval is an important multi-modal learning task, where the
goal is to retrieve the most relevant video for a given text query. Recently,
pre-trained models, e.g., CLIP, show great potential on this task. However, as
pre-trained models are scaling up, fully fine-tuning them on text-video
retrieval datasets has a high risk of overfitting. Moreover, in practice, it
would be costly to train and store a large model for each task. To overcome the
above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for
parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust
the pre-trained model with a few parameterization layers. However, there are
two notable differences. First, our method is designed for the multi-modal
domain. Secondly, it allows early cross-modal interactions between CLIP's two
encoders. Although surprisingly simple, our approach has three notable
benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and
alleviates the problem of overfitting, (2) saves approximately 30% of training
time, and (3) allows all the pre-trained parameters to be fixed, enabling the
pre-trained model to be shared across datasets. Extensive experiments
demonstrate that, without bells and whistles, it achieves superior or
comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD,
VATEX, ActivityNet, and DiDeMo datasets. The code will be available at
\url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.
Related papers
- Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts [14.610244867640471]
Recent vision-language models are driven by large-scale pretrained models.
We introduce a parameter-efficient method to address challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language.
Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency.
arXiv Detail & Related papers (2023-09-27T18:00:09Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval [60.454321238910474]
State-of-the-art video-text retrieval methods typically involve fully fine-tuning a pre-trained model on specific datasets.
We present our pioneering work that enables parameter-efficient VTR using a pre-trained model.
We propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text.
arXiv Detail & Related papers (2023-01-19T03:42:56Z) - HADA: A Graph-based Amalgamation Framework in Image-text Retrieval [2.3013879633693266]
We propose a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result.
Our experiments showed that HADA could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset.
arXiv Detail & Related papers (2023-01-11T22:25:20Z) - Multi-Head Adapter Routing for Cross-Task Generalization [56.75667096355806]
Polytropon learns an inventory of adapters and a routing function that selects a subset of adapters for each task during both pre-training and few-shot adaptation.
We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation.
arXiv Detail & Related papers (2022-11-07T19:35:55Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.