MDMMT: Multidomain Multimodal Transformer for Video Retrieval
- URL: http://arxiv.org/abs/2103.10699v1
- Date: Fri, 19 Mar 2021 09:16:39 GMT
- Title: MDMMT: Multidomain Multimodal Transformer for Video Retrieval
- Authors: Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr
Petiushko
- Abstract summary: We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks.
We show that training on different datasets can improve test results of each other.
- Score: 63.872634680339644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new state-of-the-art on the text to video retrieval task on
MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions
by a large margin. Moreover, state-of-the-art results are achieved with a
single model on two datasets without finetuning. This multidomain
generalisation is achieved by a proper combination of different video caption
datasets. We show that training on different datasets can improve test results
of each other. Additionally we check intersection between many popular datasets
and found that MSRVTT has a significant overlap between the test and the train
parts, and the same situation is observed for ActivityNet.
Related papers
- IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames.
Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability.
We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z) - Uni$^2$Det: Unified and Universal Framework for Prompt-Guided Multi-dataset 3D Detection [64.08296187555095]
Uni$2$Det is a framework for unified and universal multi-dataset training on 3D detection.
We introduce multi-stage prompting modules for multi-dataset 3D detection.
Results on zero-shot cross-dataset transfer validate the generalization capability of our proposed method.
arXiv Detail & Related papers (2024-09-30T17:57:50Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit [6.187270874122921]
We propose a toolkit for systematic multimodal VAE training and comparison.
We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Multi-query Video Retrieval [44.32936301162444]
We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.
We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries.
We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
arXiv Detail & Related papers (2022-01-10T20:44:46Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.