Related papers: Cross-Modal Adapter for Vision-Language Retrieval

Cross-Modal Adapter for Vision-Language Retrieval

URL: http://arxiv.org/abs/2211.09623v2
Date: Sat, 30 Aug 2025 16:28:29 GMT
Title: Cross-Modal Adapter for Vision-Language Retrieval
Authors: Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, Gao Huang,
Abstract summary: We present a novel Cross-Modal Adapter for parameter-efficient transfer learning.<n>Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.<n>Our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed.
Score: 60.59577149733934
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image-text retrieval datasets (MSCOCO, Flickr30K) and video-text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).

Related papers

CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets [0.9642500063568188]
This paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. With only 12.8% trainable parameters compared to the backbone to process video input, we achieve comparable and even better results than the state-of-the-art.
arXiv Detail & Related papers (2025-01-06T19:01:10Z)
Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models [38.751158173278796]
This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference.
arXiv Detail & Related papers (2024-04-19T02:33:23Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z)
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts [14.610244867640471]
Recent vision-language models are driven by large-scale pretrained models. We introduce a parameter-efficient method to address challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language. Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency.
arXiv Detail & Related papers (2023-09-27T18:00:09Z)
MoMo: A shared encoder Model for text, image and multi-Modal representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks. We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling [49.134517040512414]
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models. Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
arXiv Detail & Related papers (2023-02-13T18:59:10Z)
MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval [60.454321238910474]
State-of-the-art video-text retrieval methods typically involve fully fine-tuning a pre-trained model on specific datasets. We present our pioneering work that enables parameter-efficient VTR using a pre-trained model. We propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text.
arXiv Detail & Related papers (2023-01-19T03:42:56Z)
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval [2.3013879633693266]
We propose a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result. Our experiments showed that HADA could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset.
arXiv Detail & Related papers (2023-01-11T22:25:20Z)
Multi-Head Adapter Routing for Cross-Task Generalization [56.75667096355806]
Polytropon learns an inventory of adapters and a routing function that selects a subset of adapters for each task during both pre-training and few-shot adaptation. We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation.
arXiv Detail & Related papers (2022-11-07T19:35:55Z)
Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition. We introduce the object query for segmentation and the attribute query for attribute prediction. For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.