VLG-Net: Video-Language Graph Matching Network for Video Grounding
- URL: http://arxiv.org/abs/2011.10132v2
- Date: Mon, 16 Aug 2021 14:53:59 GMT
- Title: VLG-Net: Video-Language Graph Matching Network for Video Grounding
- Authors: Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, Bernard Ghanem
- Abstract summary: Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
- Score: 57.6661145190528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding language queries in videos aims at identifying the time interval
(or moment) semantically relevant to a language query. The solution to this
challenging task demands understanding videos' and queries' semantic content
and the fine-grained reasoning about their multi-modal interactions. Our key
idea is to recast this challenge into an algorithmic graph matching problem.
Fueled by recent advances in Graph Neural Networks, we propose to leverage
Graph Convolutional Networks to model video and textual information as well as
their semantic alignment. To enable the mutual exchange of information across
the modalities, we design a novel Video-Language Graph Matching Network
(VLG-Net) to match video and query graphs. Core ingredients include
representation graphs built atop video snippets and query tokens separately and
used to model intra-modality relationships. A Graph Matching layer is adopted
for cross-modal context modeling and multi-modal fusion. Finally, moment
candidates are created using masked moment attention pooling by fusing the
moment's enriched snippet features. We demonstrate superior performance over
state-of-the-art grounding methods on three widely used datasets for temporal
localization of moments in videos with language queries: ActivityNet-Captions,
TACoS, and DiDeMo.
Related papers
- RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language
Models [33.56759621666477]
We present a benchmark dataset for evaluating the integration of graph knowledge into language models.
The proposed dataset is designed to evaluate graph-language models' ability to understand graphs and make use of it for answer generation.
We perform experiments with language-only models and the proposed graph-language model to validate the usefulness of the paired graphs and to demonstrate the difficulty of the task.
arXiv Detail & Related papers (2023-10-12T16:46:58Z) - Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Relation-aware Video Reading Comprehension for Temporal Language
Grounding [67.5613853693704]
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it.
arXiv Detail & Related papers (2021-10-12T03:10:21Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.