Visual Relationship Forecasting in Videos
- URL: http://arxiv.org/abs/2107.01181v1
- Date: Fri, 2 Jul 2021 16:43:19 GMT
- Title: Visual Relationship Forecasting in Videos
- Authors: Li Mi, Yangjun Ou, Zhenzhong Chen
- Abstract summary: We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
- Score: 56.122037294234865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world scenarios often require the anticipation of object interactions in
unknown future, which would assist the decision-making process of both humans
and agents. To meet this challenge, we present a new task named Visual
Relationship Forecasting (VRF) in videos to explore the prediction of visual
relationships in a reasoning manner. Specifically, given a subject-object pair
with H existing frames, VRF aims to predict their future interactions for the
next T frames without visual evidence. To evaluate the VRF task, we introduce
two video datasets named VRF-AG and VRF-VidOR, with a series of
spatio-temporally localized visual relation annotations in a video. These two
datasets densely annotate 13 and 35 visual relationships in 1923 and 13447
video clips, respectively. In addition, we present a novel Graph Convolutional
Transformer (GCT) framework, which captures both object-level and frame-level
dependencies by spatio-temporal Graph Convolution Network and Transformer.
Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT
outperforms the state-of-the-art sequence modelling methods on visual
relationship forecasting.
Related papers
- CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos [9.807247838436489]
We introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos.
We propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies.
The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order.
arXiv Detail & Related papers (2024-06-03T06:24:55Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Collaboratively Self-supervised Video Representation Learning for Action
Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Understanding Dynamic Scenes using Graph Convolution Networks [22.022759283770377]
We present a novel framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving camera.
We show a seamless transfer of learning to multiple datasets without resorting to fine-tuning.
Such behavior prediction methods find immediate relevance in a variety of navigation tasks.
arXiv Detail & Related papers (2020-05-09T13:05:06Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.