Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization
- URL: http://arxiv.org/abs/2008.01403v2
- Date: Thu, 13 Aug 2020 01:56:06 GMT
- Title: Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization
- Authors: Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou,
Zichuan Xu
- Abstract summary: We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
- Score: 77.21951145754065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Query-based moment localization is a new task that localizes the best matched
segment in an untrimmed video according to a given sentence query. In this
localization task, one should pay more attention to thoroughly mine visual and
linguistic information. To this end, we propose a novel Cross- and Self-Modal
Graph Attention Network (CSMGAN) that recasts this task as a process of
iterative messages passing over a joint graph. Specifically, the joint graph
consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph
(SMG), where frames and words are represented as nodes, and the relations
between cross- and self-modal node pairs are described by an attention
mechanism. Through parametric message passing, CMG highlights relevant
instances across video and sentence, and then SMG models the pairwise relation
inside each modality for frame (word) correlating. With multiple layers of such
a joint graph, our CSMGAN is able to effectively capture high-order
interactions between two modalities, thus enabling a further precise
localization. Besides, to better comprehend the contextual details in the
query, we develop a hierarchical sentence encoder to enhance the query
understanding. Extensive experiments on four public datasets demonstrate the
effectiveness of our proposed model, and GCSMAN significantly outperforms the
state-of-the-arts.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Graph Ordering Attention Networks [22.468776559433614]
Graph Neural Networks (GNNs) have been successfully used in many problems involving graph-structured data.
We introduce the Graph Ordering Attention (GOAT) layer, a novel GNN component that captures interactions between nodes in a neighborhood.
GOAT layer demonstrates its increased performance in modeling graph metrics that capture complex information.
arXiv Detail & Related papers (2022-04-11T18:13:19Z) - DigNet: Digging Clues from Local-Global Interactive Graph for
Aspect-level Sentiment Classification [0.685316573653194]
In aspect-level sentiment classification (ASC), state-of-the-art models encode either syntax graph or relation graph.
We design a novel local-global interactive graph, which marries their advantages by stitching the two graphs via interactive edges.
In this paper, we propose a novel neural network termed DigNet, whose core module is the stacked local-global interactive layers.
arXiv Detail & Related papers (2022-01-04T05:34:02Z) - r-GAT: Relational Graph Attention Network for Multi-Relational Graphs [8.529080554172692]
Graph Attention Network (GAT) focuses on modelling simple undirected and single relational graph data only.
We propose r-GAT, a relational graph attention network to learn multi-channel entity representations.
Experiments on link prediction and entity classification tasks show that our r-GAT can model multi-relational graphs effectively.
arXiv Detail & Related papers (2021-09-13T12:43:00Z) - Multi Scale Temporal Graph Networks For Skeleton-based Action
Recognition [5.970574258839858]
Graph convolutional networks (GCNs) can effectively capture the features of related nodes and improve the performance of the model.
Existing methods based on GCNs have two problems. First, the consistency of temporal and spatial features is ignored for extracting features node by node and frame by frame.
We propose a novel model called Temporal Graph Networks (TGN) for action recognition.
arXiv Detail & Related papers (2020-12-05T08:08:25Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z) - Understanding Dynamic Scenes using Graph Convolution Networks [22.022759283770377]
We present a novel framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving camera.
We show a seamless transfer of learning to multiple datasets without resorting to fine-tuning.
Such behavior prediction methods find immediate relevance in a variety of navigation tasks.
arXiv Detail & Related papers (2020-05-09T13:05:06Z) - Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network.
Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.