MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization
- URL: http://arxiv.org/abs/2203.07086v1
- Date: Mon, 14 Mar 2022 13:15:09 GMT
- Title: MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization
- Authors: Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei
Ivaniuta
- Abstract summary: Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
- Score: 65.09758931804478
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work we present a new State-of-The-Art on the text-to-video retrieval
task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model.
Three different data sources are combined: weakly-supervised videos,
crowd-labeled text-image pairs and text-video pairs. A careful analysis of
available pre-trained networks helps to choose the best prior-knowledge ones.
We introduce three-stage training procedure that provides high transfer
knowledge efficiency and allows to use noisy datasets during training without
prior knowledge degradation. Additionally, double positional encoding is used
for better fusion of different modalities and a simple method for non-square
inputs processing is suggested.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z) - Delving Deeper into the Decoder for Video Captioning [23.202746094988715]
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.
We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model.
It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets.
arXiv Detail & Related papers (2020-01-16T02:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.