Boosting Video-Text Retrieval with Explicit High-Level Semantics
- URL: http://arxiv.org/abs/2208.04215v2
- Date: Tue, 9 Aug 2022 03:52:28 GMT
- Title: Boosting Video-Text Retrieval with Explicit High-Level Semantics
- Authors: Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, Errui
Ding
- Abstract summary: We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
- Score: 115.66219386097295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-text retrieval (VTR) is an attractive yet challenging task for
multi-modal understanding, which aims to search for relevant video (text) given
a query (video). Existing methods typically employ completely heterogeneous
visual-textual information to align video and text, whilst lacking the
awareness of homogeneous high-level semantic information residing in both
modalities. To fill this gap, in this work, we propose a novel
visual-linguistic aligning model named HiSE for VTR, which improves the
cross-modal representation by incorporating explicit high-level semantics.
First, we explore the hierarchical property of explicit high-level semantics,
and further decompose it into two levels, i.e. discrete semantics and holistic
semantics. Specifically, for visual branch, we exploit an off-the-shelf
semantic entity predictor to generate discrete high-level semantics. In
parallel, a trained video captioning model is employed to output holistic
high-level semantics. As for the textual modality, we parse the text into three
parts including occurrence, action and entity. In particular, the occurrence
corresponds to the holistic high-level semantics, meanwhile both action and
entity represent the discrete ones. Then, different graph reasoning techniques
are utilized to promote the interaction between holistic and discrete
high-level semantics. Extensive experiments demonstrate that, with the aid of
explicit high-level semantics, our method achieves the superior performance
over state-of-the-art methods on three benchmark datasets, including MSR-VTT,
MSVD and DiDeMo.
Related papers
- Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
Moment Retrieval [31.42856682276394]
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query.
Existing strategies are often sub-optimal since they ignore the modality imbalance problem.
We introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment.
arXiv Detail & Related papers (2023-12-19T13:38:48Z) - GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding [101.32590239809113]
Generalized Perception NeRF (GP-NeRF) is a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework.
We propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field.
arXiv Detail & Related papers (2023-11-20T15:59:41Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - A semantically enhanced dual encoder for aspect sentiment triplet
extraction [0.7291396653006809]
Aspect sentiment triplet extraction (ASTE) is a crucial subtask of aspect-based sentiment analysis (ABSA)
Previous research has focused on enhancing ASTE through innovative table-filling strategies.
We propose a framework that leverages both a basic encoder, primarily based on BERT, and a particular encoder comprising a Bi-LSTM network and graph convolutional network (GCN)
Experiments conducted on benchmark datasets demonstrate the state-of-the-art performance of our proposed framework.
arXiv Detail & Related papers (2023-06-14T09:04:14Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Semantic Role Aware Correlation Transformer for Text to Video Retrieval [23.183653281610866]
This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts.
Preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics.
arXiv Detail & Related papers (2022-06-26T11:28:03Z) - TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic
Segmentation [44.75300205362518]
Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations.
We propose the first top-down unsupervised semantic segmentation framework for fine-grained segmentation in extremely complicated scenarios.
Our results show that our top-down unsupervised segmentation is robust to both object-centric and scene-centric datasets.
arXiv Detail & Related papers (2021-12-02T18:59:03Z) - Hierarchical Modular Network for Video Captioning [162.70349114104107]
We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions.
The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
arXiv Detail & Related papers (2021-11-24T13:07:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.