MDMMT: Multidomain Multimodal Transformer for Video Retrieval
- URL: http://arxiv.org/abs/2103.10699v1
- Date: Fri, 19 Mar 2021 09:16:39 GMT
- Title: MDMMT: Multidomain Multimodal Transformer for Video Retrieval
- Authors: Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr
Petiushko
- Abstract summary: We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks.
We show that training on different datasets can improve test results of each other.
- Score: 63.872634680339644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new state-of-the-art on the text to video retrieval task on
MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions
by a large margin. Moreover, state-of-the-art results are achieved with a
single model on two datasets without finetuning. This multidomain
generalisation is achieved by a proper combination of different video caption
datasets. We show that training on different datasets can improve test results
of each other. Additionally we check intersection between many popular datasets
and found that MSRVTT has a significant overlap between the test and the train
parts, and the same situation is observed for ActivityNet.
Related papers
- UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z) - GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for
Remote Sensing Data [27.63411386396492]
This paper introduces a new benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data.
The proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data.
arXiv Detail & Related papers (2023-05-24T09:03:18Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and
Toolkit [0.0]
We propose a toolkit for systematic multimodal VAE training and comparison.
We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - Multi-query Video Retrieval [44.32936301162444]
We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.
We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries.
We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
arXiv Detail & Related papers (2022-01-10T20:44:46Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.