Transframer: Arbitrary Frame Prediction with Generative Models
- URL: http://arxiv.org/abs/2203.09494v2
- Date: Fri, 18 Mar 2022 10:34:43 GMT
- Title: Transframer: Arbitrary Frame Prediction with Generative Models
- Authors: Charlie Nash, Jo\~ao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle,
Mateusz Malinowski, Peter Battaglia
- Abstract summary: We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction.
We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames.
- Score: 21.322137081404904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a general-purpose framework for image modelling and vision tasks
based on probabilistic frame prediction. Our approach unifies a broad range of
tasks, from image segmentation, to novel view synthesis and video
interpolation. We pair this framework with an architecture we term Transframer,
which uses U-Net and Transformer components to condition on annotated context
frames, and outputs sequences of sparse, compressed image features. Transframer
is the state-of-the-art on a variety of video generation benchmarks, is
competitive with the strongest models on few-shot view synthesis, and can
generate coherent 30 second videos from a single image without any explicit
geometric information. A single generalist Transframer simultaneously produces
promising results on 8 tasks, including semantic segmentation, image
classification and optical flow prediction with no task-specific architectural
components, demonstrating that multi-task computer vision can be tackled using
probabilistic image models. Our approach can in principle be applied to a wide
range of applications that require learning the conditional structure of
annotated image-formatted data.
Related papers
- TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - ComFe: Interpretable Image Classifiers With Foundation Models, Transformers and Component Features [0.0]
Component Features (ComFe) is a novel interpretable-by-design image classification approach.
It is highly scalable and can obtain better accuracy and robustness in comparison to non-interpretable methods.
arXiv Detail & Related papers (2024-03-07T00:44:21Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.