Modality-Balanced Embedding for Video Retrieval
- URL: http://arxiv.org/abs/2204.08182v1
- Date: Mon, 18 Apr 2022 06:29:46 GMT
- Title: Modality-Balanced Embedding for Video Retrieval
- Authors: Xun Wang, Bingqing Ke, Xuanping Li, Fangyu Liu, Mingyu Zhang, Xiao
Liang, Qiushi Xiao, Yue Yu
- Abstract summary: We identify a modality bias phenomenon that the video encoder almost entirely relies on text matching.
We propose MBVR (short for Modality Balanced Video Retrieval) with two key components.
We show empirically that our method is both effective and efficient in solving modality bias problem.
- Score: 21.81705847039759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video search has become the main routine for users to discover videos
relevant to a text query on large short-video sharing platforms. During
training a query-video bi-encoder model using online search logs, we identify a
modality bias phenomenon that the video encoder almost entirely relies on text
matching, neglecting other modalities of the videos such as vision, audio. This
modality imbalanceresults from a) modality gap: the relevance between a query
and a video text is much easier to learn as the query is also a piece of text,
with the same modality as the video text; b) data bias: most training samples
can be solved solely by text matching. Here we share our practices to improve
the first retrieval stage including our solution for the modality imbalance
issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two
key components: manually generated modality-shuffled (MS) samples and a dynamic
margin (DM) based on visual relevance. They can encourage the video encoder to
pay balanced attentions to each modality. Through extensive experiments on a
real world dataset, we show empirically that our method is both effective and
efficient in solving modality bias problem. We have also deployed our MBVR in a
large video platform and observed statistically significant boost over a highly
optimized baseline in an A/B test and manual GSB evaluations.
Related papers
- VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.