BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions
- URL: http://arxiv.org/abs/2201.04850v1
- Date: Thu, 13 Jan 2022 09:33:54 GMT
- Title: BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions
- Authors: Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie and
Ping Luo
- Abstract summary: We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
- Score: 38.843518809230524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training a model to learn transferable video-text representation for
retrieval has attracted a lot of attention in recent years. Previous dominant
works mainly adopt two separate encoders for efficient retrieval, but ignore
local associations between videos and texts. Another line of research uses a
joint encoder to interact video with texts, but results in low efficiency since
each text-video pair needs to be fed into the model. In this work, we enable
fine-grained video-text interactions while maintaining high efficiency for
retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ),
where a parametric module BridgeFormer is trained to answer the "questions"
constructed by the text features via resorting to the video features.
Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to
build questions, with which the video encoder can be trained to capture more
regional content and temporal dynamics. In the form of questions and answers,
the semantic associations between local video-text features can be properly
established. BridgeFormer is able to be removed for downstream retrieval,
rendering an efficient and flexible model with only two encoders. Our method
outperforms state-of-the-art methods on the popular text-to-video retrieval
task in five datasets with different experimental setups (i.e., zero-shot and
fine-tune), including HowTo100M (one million videos). We further conduct
zero-shot action recognition, which can be cast as video-to-text retrieval, and
our approach also significantly surpasses its counterparts. As an additional
benefit, our method achieves competitive results with much shorter pre-training
videos on single-modality downstream tasks, e.g., action recognition with
linear evaluation.
Related papers
- Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.