M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- URL: http://arxiv.org/abs/2401.17797v1
- Date: Wed, 31 Jan 2024 12:45:44 GMT
- Title: M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Authors: Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang,
Qingpei Guo
- Abstract summary: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
- Score: 13.418762442122723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training
towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
Upon popular image-text models like CLIP, most current adaptation-based
video-text pre-training methods are confronted by three major issues, i.e.,
noisy data corpus, time-consuming pre-training, and limited performance gain.
Towards this end, we conduct a comprehensive study including four critical
steps in video-text pre-training. Specifically, we investigate 1) data
filtering and refinement, 2) video input type selection, 3) temporal modeling,
and 4) video feature enhancement. We then summarize this empirical study into
the M2-RAAP recipe, where our technical contributions lie in 1) the data
filtering and text re-writing pipeline resulting in 1M high-quality bilingual
video-text pairs, 2) the replacement of video inputs with key-frames to
accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to
enhance video features. We conduct extensive experiments by adapting three
image-text foundation models on two refined video-text datasets from different
languages, validating the robustness and reproducibility of M2-RAAP for
adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior
performance with significantly reduced data (-90%) and time consumption (-95%),
establishing a new SOTA on four English zero-shot retrieval datasets and two
Chinese ones. We are preparing our refined bilingual data annotations and
codebase, which will be available at
https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.
Related papers
- Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.