Exploiting Semantic Role Contextualized Video Features for
Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022
- URL: http://arxiv.org/abs/2206.14381v2
- Date: Tue, 26 Sep 2023 14:27:30 GMT
- Title: Exploiting Semantic Role Contextualized Video Features for
Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022
- Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
- Abstract summary: We present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
We first parse sentences into semantic roles corresponding to verbs and nouns, then utilize self-attentions to exploit semantic role contextualized video features.
- Score: 72.12974259966592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022. We first parse sentences into semantic roles
corresponding to verbs and nouns; then utilize self-attentions to exploit
semantic role contextualized video features along with textual features via
triplet losses in multiple embedding spaces. Our method overpasses the strong
baseline in normalized Discounted Cumulative Gain (nDCG), which is more
valuable for semantic similarity. Our submission is ranked 3rd for nDCG and
ranked 4th for mAP.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Grounded Video Situation Recognition [37.279915290069326]
We present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions.
Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly.
arXiv Detail & Related papers (2022-10-19T18:38:10Z) - Constructing Phrase-level Semantic Labels to Form Multi-Grained
Supervision for Image-Text Retrieval [48.20798265640068]
We introduce additional phrase-level supervision for the better identification of mismatched units in the text.
We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels.
For the training, we propose multi-scale matching losses from both global and local perspectives.
arXiv Detail & Related papers (2021-09-12T14:21:15Z) - Video-aided Unsupervised Grammar Induction [108.53765268059425]
We investigate video-aided grammar induction, which learns a constituency from both unlabeled text and its corresponding video.
Video provides even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases.
We propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.
arXiv Detail & Related papers (2021-04-09T14:01:36Z) - Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.