Compositional Video Synthesis with Action Graphs
- URL: http://arxiv.org/abs/2006.15327v4
- Date: Thu, 10 Jun 2021 21:07:15 GMT
- Title: Compositional Video Synthesis with Action Graphs
- Authors: Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik,
Trevor Darrell, Amir Globerson
- Abstract summary: Videos of actions are complex signals containing rich compositional structure in space and time.
We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task.
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
- Score: 112.94651460161992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos of actions are complex signals containing rich compositional structure
in space and time. Current video generation methods lack the ability to
condition the generation on multiple coordinated and potentially simultaneous
timed actions. To address this challenge, we propose to represent the actions
in a graph structure called Action Graph and present the new ``Action Graph To
Video'' synthesis task. Our generative model for this task (AG2Vid)
disentangles motion and appearance features, and by incorporating a scheduling
mechanism for actions facilitates a timely and coordinated video generation. We
train and evaluate AG2Vid on the CATER and Something-Something V2 datasets, and
show that the resulting videos have better visual quality and semantic
consistency compared to baselines. Finally, our model demonstrates zero-shot
abilities by synthesizing novel compositions of the learned actions. For code
and pretrained models, see the project page https://roeiherz.github.io/AG2Video
Related papers
- VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Pose-guided Generative Adversarial Net for Novel View Action Synthesis [6.019777076722422]
Given an action video, the goal is to generate the same action from an unseen viewpoint.
We propose a novel framework named Pose-guided Action Separable Generative Adversarial Net (PAS-GAN)
We employ a novel local-global spatial transformation module to effectively generate sequential video features in the target view.
arXiv Detail & Related papers (2021-10-15T10:33:09Z) - Sketch Me A Video [32.38205496481408]
We introduce a new video synthesis task by employing two rough bad-drwan sketches only as input to create a realistic portrait video.
A two-stage Sketch-to-Video model is proposed, which consists of two key novelties.
arXiv Detail & Related papers (2021-10-10T05:40:11Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - Temporal Relational Modeling with Self-Supervision for Action
Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.