Patch-based Object-centric Transformers for Efficient Video Generation
- URL: http://arxiv.org/abs/2206.04003v1
- Date: Wed, 8 Jun 2022 16:29:59 GMT
- Title: Patch-based Object-centric Transformers for Efficient Video Generation
- Authors: Wilson Yan, Ryo Okumura, Stephen James, Pieter Abbeel
- Abstract summary: We present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture.
We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos.
Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information.
- Score: 71.55412580325743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present Patch-based Object-centric Video Transformer (POVT),
a novel region-based video generation architecture that leverages
object-centric information to efficiently model temporal dynamics in videos. We
build upon prior work in video prediction via an autoregressive transformer
over the discrete latent space of compressed videos, with an added modification
to model object-centric information via bounding boxes. Due to better
compressibility of object-centric representations, we can improve training
efficiency by allowing the model to only access object information for longer
horizon temporal information. When evaluated on various difficult
object-centric datasets, our method achieves better or equal performance to
other video generation models, while remaining computationally more efficient
and scalable. In addition, we show that our method is able to perform
object-centric controllability through bounding box manipulation, which may aid
downstream tasks such as video editing, or visual planning. Samples are
available at
https://sites.google.com/view/povt-public}{https://sites.google.com/view/povt-public
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention [29.62044843067169]
Video object segmentation is a fundamental research problem in computer vision.
We propose a new method for self-supervised video object segmentation based on distillation learning of deformable attention.
arXiv Detail & Related papers (2024-01-25T04:39:48Z) - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object.
We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control.
We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - Video based Object 6D Pose Estimation using Transformers [6.951360830202521]
VideoPose is an end-to-end attention based modelling architecture that attends to previous frames in order to estimate 6D Object Poses in videos.
Our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.
Our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches.
arXiv Detail & Related papers (2022-10-24T18:45:53Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.