M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- URL: http://arxiv.org/abs/2401.17797v1
- Date: Wed, 31 Jan 2024 12:45:44 GMT
- Title: M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Authors: Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang,
Qingpei Guo
- Abstract summary: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
- Score: 13.418762442122723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training
towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
Upon popular image-text models like CLIP, most current adaptation-based
video-text pre-training methods are confronted by three major issues, i.e.,
noisy data corpus, time-consuming pre-training, and limited performance gain.
Towards this end, we conduct a comprehensive study including four critical
steps in video-text pre-training. Specifically, we investigate 1) data
filtering and refinement, 2) video input type selection, 3) temporal modeling,
and 4) video feature enhancement. We then summarize this empirical study into
the M2-RAAP recipe, where our technical contributions lie in 1) the data
filtering and text re-writing pipeline resulting in 1M high-quality bilingual
video-text pairs, 2) the replacement of video inputs with key-frames to
accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to
enhance video features. We conduct extensive experiments by adapting three
image-text foundation models on two refined video-text datasets from different
languages, validating the robustness and reproducibility of M2-RAAP for
adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior
performance with significantly reduced data (-90%) and time consumption (-95%),
establishing a new SOTA on four English zero-shot retrieval datasets and two
Chinese ones. We are preparing our refined bilingual data annotations and
codebase, which will be available at
https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.
Related papers
- T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.