CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text
Retrieval
- URL: http://arxiv.org/abs/2111.05610v1
- Date: Wed, 10 Nov 2021 10:05:11 GMT
- Title: CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text
Retrieval
- Authors: Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, Jinwei
Yuan
- Abstract summary: We present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods.
Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.
- Score: 14.022356429411934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern video-text retrieval frameworks basically consist of three parts:
video encoder, text encoder and the similarity head. With the success on both
visual and textual representation learning, transformer based encoders and
fusion methods have also been adopted in the field of video-text retrieval. In
this report, we present CLIP2TV, aiming at exploring where the critical
elements lie in transformer based methods. To achieve this, We first revisit
some recent works on multi-modal learning, then introduce some techniques into
video-text retrieval, finally evaluate them through extensive experiments in
different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset,
outperforming the previous SOTA result by 4.1%.
Related papers
- Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - End-to-End Video Text Spotting with Transformer [86.46724646835627]
We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
arXiv Detail & Related papers (2022-03-20T12:14:58Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text
Spotter with Transformer [12.167938646139705]
We introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText)
Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos.
Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc.
arXiv Detail & Related papers (2021-12-09T13:21:26Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval [40.646628490887075]
We propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval.
HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results.
Inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly.
arXiv Detail & Related papers (2021-03-28T04:52:25Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.