Target Adaptive Context Aggregation for Video Scene Graph Generation
- URL: http://arxiv.org/abs/2108.08121v1
- Date: Wed, 18 Aug 2021 12:46:28 GMT
- Title: Target Adaptive Context Aggregation for Video Scene Graph Generation
- Authors: Yao Teng, Limin Wang, Zhifeng Li, Gangshan Wu
- Abstract summary: This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
- Score: 36.669700084337045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper deals with a challenging task of video scene graph generation
(VidSGG), which could serve as a structured video representation for high-level
understanding tasks. We present a new {\em detect-to-track} paradigm for this
task by decoupling the context modeling for relation prediction from the
complicated low-level entity tracking. Specifically, we design an efficient
method for frame-level VidSGG, termed as {\em Target Adaptive Context
Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context
information for relation recognition. Our TRACE framework streamlines the
VidSGG pipeline with a modular design, and presents two unique blocks of
Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context
Aggregation. More specific, our HRTree first provides an adpative structure for
organizing possible relation candidates efficiently, and guides context
aggregation module to effectively capture spatio-temporal structure
information. Then, we obtain a contextualized feature representation for each
relation candidate and build a classification head to recognize its relation
category. Finally, we provide a simple temporal association strategy to track
TRACE detected results to yield the video-level VidSGG. We perform experiments
on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results
demonstrate that our TRACE achieves the state-of-the-art performance. The code
and models are made available at \url{https://github.com/MCG-NJU/TRACE}.
Related papers
- Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Structured Sparse R-CNN for Direct Scene Graph Generation [16.646937866282922]
This paper presents a simple, sparse, and unified framework for relation detection, termed as Structured Sparse R-CNN.
The key to our method is a set of learnable triplet queries and structured triplet detectors which could be optimized jointly from the training set in an end-to-end manner.
We perform experiments on two benchmarks: Visual Genome and Open Images, and the results demonstrate that our method achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-06-21T02:24:20Z) - Structured Co-reference Graph Attention for Video-grounded Dialogue [17.797726722637634]
The Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video.
Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on two benchmarks.
arXiv Detail & Related papers (2021-03-24T17:36:33Z) - Bidirectional Graph Reasoning Network for Panoptic Segmentation [126.06251745669107]
We introduce a Bidirectional Graph Reasoning Network (BGRNet) to mine the intra-modular and intermodular relations within and between foreground things and background stuff classes.
BGRNet first constructs image-specific graphs in both instance and semantic segmentation branches that enable flexible reasoning at the proposal level and class level.
arXiv Detail & Related papers (2020-04-14T02:32:10Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.