MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization
- URL: http://arxiv.org/abs/2203.07086v1
- Date: Mon, 14 Mar 2022 13:15:09 GMT
- Title: MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization
- Authors: Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei
Ivaniuta
- Abstract summary: Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
- Score: 65.09758931804478
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work we present a new State-of-The-Art on the text-to-video retrieval
task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model.
Three different data sources are combined: weakly-supervised videos,
crowd-labeled text-image pairs and text-video pairs. A careful analysis of
available pre-trained networks helps to choose the best prior-knowledge ones.
We introduce three-stage training procedure that provides high transfer
knowledge efficiency and allows to use noisy datasets during training without
prior knowledge degradation. Additionally, double positional encoding is used
for better fusion of different modalities and a simple method for non-square
inputs processing is suggested.
Related papers
- T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.