Related papers: M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

URL: http://arxiv.org/abs/2401.17797v1
Date: Wed, 31 Jan 2024 12:45:44 GMT
Title: M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Authors: Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang, Qingpei Guo
Abstract summary: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP.
Score: 13.418762442122723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M2-RAAP for adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior performance with significantly reduced data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. We are preparing our refined bilingual data annotations and codebase, which will be available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.

Related papers

Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP. Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier. We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z)
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA. R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model. With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z)
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z)
BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ) A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. In the form of questions and answers, the semantic associations between local video-text features can be properly established. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z)
CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM) This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.