Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval
- URL: http://arxiv.org/abs/2206.02082v4
- Date: Tue, 11 Apr 2023 02:29:20 GMT
- Title: Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval
- Authors: Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou,
Heng Ji, Shih-Fu Chang
- Abstract summary: Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
- Score: 70.30052749168013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-channel video-language retrieval require models to understand
information from different channels (e.g. video$+$question, video$+$speech) to
correctly link a video with a textual response or query. Fortunately,
contrastive multimodal models are shown to be highly effective at aligning
entities in images/videos and text, e.g., CLIP; text contrastive models are
extensively studied recently for their strong ability of producing
discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear
way to quickly adapt these two lines to multi-channel video-language retrieval
with limited data and resources. In this paper, we identify a principled model
design space with two axes: how to represent videos and how to fuse video and
text information. Based on categorization of recent methods, we investigate the
options of representing videos using continuous feature vectors or discrete
text tokens; for the fusion method, we explore the use of a multimodal
transformer or a pretrained contrastive text model. We extensively evaluate the
four combinations on five video-language datasets. We surprisingly find that
discrete text tokens coupled with a pretrained contrastive text model yields
the best performance, which can even outperform state-of-the-art on the iVQA
and How2QA datasets without additional training on millions of video-text data.
Further analysis shows that this is because representing videos as text tokens
captures the key visual information and text tokens are naturally aligned with
text models that are strong retrievers after the contrastive pretraining
process. All the empirical analysis establishes a solid foundation for future
research on affordable and upgradable multimodal intelligence.
Related papers
- In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.