Related papers: MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

URL: http://arxiv.org/abs/2203.07086v1
Date: Mon, 14 Mar 2022 13:15:09 GMT
Title: MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Authors: Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta
Abstract summary: Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
Score: 65.09758931804478
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

Related papers

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval [24.764393859378544]
We introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR) We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts. We conduct extensive experiments on five diverse datasets.
arXiv Detail & Related papers (2025-04-02T10:56:01Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. Recent contrastive learning methods have shown promising results for text-video retrieval. This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z)
MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video. We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques. We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z)
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
Delving Deeper into the Decoder for Video Captioning [23.202746094988715]
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model. It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets.
arXiv Detail & Related papers (2020-01-16T02:18:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.