RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval
- URL: http://arxiv.org/abs/2206.12845v1
- Date: Sun, 26 Jun 2022 11:12:49 GMT
- Title: RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval
- Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
- Abstract summary: We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
- Score: 66.2075707179047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Seas of videos are uploaded daily with the popularity of social channels;
thus, retrieving the most related video contents with user textual queries
plays a more crucial role. Most methods consider only one joint embedding space
between global visual and textual features without considering the local
structures of each modality. Some other approaches consider multiple embedding
spaces consisting of global and local features separately, ignoring rich
inter-modality correlations.
We propose a novel mixture-of-expert transformer RoME that disentangles the
text and the video into three levels; the roles of spatial contexts, temporal
contexts, and object contexts. We utilize a transformer-based attention
mechanism to fully exploit visual and text embeddings at both global and local
levels with mixture-of-experts for considering inter-modalities and structures'
correlations. The results indicate that our method outperforms the
state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same
visual backbone without pre-training. Finally, we conducted extensive ablation
studies to elucidate our design choices.
Related papers
- Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Semantic Role Aware Correlation Transformer for Text to Video Retrieval [23.183653281610866]
This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts.
Preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics.
arXiv Detail & Related papers (2022-06-26T11:28:03Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval [59.990432265734384]
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
arXiv Detail & Related papers (2021-04-20T15:26:24Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.