Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
Moment Retrieval
- URL: http://arxiv.org/abs/2312.12155v1
- Date: Tue, 19 Dec 2023 13:38:48 GMT
- Title: Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
Moment Retrieval
- Authors: Zhihang Liu, Jun Li, Hongtao Xie, Pandeng Li, Jiannan Ge, Sun-Ao Liu,
Guoqing Jin
- Abstract summary: Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query.
Existing strategies are often sub-optimal since they ignore the modality imbalance problem.
We introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment.
- Score: 31.42856682276394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed
videos corresponding to a given language query by constructing cross-modal
alignment strategies. However, these existing strategies are often sub-optimal
since they ignore the modality imbalance problem, \textit{i.e.}, the semantic
richness inherent in videos far exceeds that of a given limited-length
sentence. Therefore, in pursuit of better alignment, a natural idea is
enhancing the video modality to filter out query-irrelevant semantics, and
enhancing the text modality to capture more segment-relevant knowledge. In this
paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework
for more balanced alignment through enhancing features at two levels. First, we
enhance the video modality at the frame-word level through word reconstruction.
This strategy emphasizes the portions associated with query words in
frame-level features while suppressing irrelevant parts. Therefore, the
enhanced video contains less redundant semantics and is more balanced with the
textual modality. Second, we enhance the textual modality at the
segment-sentence level by learning complementary knowledge from context
sentences and ground-truth segments. With the knowledge added to the query, the
textual modality thus maintains more meaningful semantics and is more balanced
with the video modality. By implementing two levels of MESM, the semantic
information from both modalities is more balanced to align, thereby bridging
the modality gap. Experiments on three widely used benchmarks, including the
out-of-distribution settings, show that the proposed framework achieves a new
start-of-the-art performance with notable generalization ability (e.g., 4.42%
and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code
will be available at https://github.com/lntzm/MESM.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - Semantic Role Aware Correlation Transformer for Text to Video Retrieval [23.183653281610866]
This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts.
Preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics.
arXiv Detail & Related papers (2022-06-26T11:28:03Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.