CCVS: Context-aware Controllable Video Synthesis
- URL: http://arxiv.org/abs/2107.08037v1
- Date: Fri, 16 Jul 2021 17:57:44 GMT
- Title: CCVS: Context-aware Controllable Video Synthesis
- Authors: Guillaume Le Moing and Jean Ponce and Cordelia Schmid
- Abstract summary: presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
- Score: 95.22008742695772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This presentation introduces a self-supervised learning approach to the
synthesis of new video clips from old ones, with several new key elements for
improved spatial resolution and realism: It conditions the synthesis process on
contextual information for temporal continuity and ancillary information for
fine control. The prediction model is doubly autoregressive, in the latent
space of an autoencoder for forecasting, and in image space for updating
contextual information, which is also used to enforce spatio-temporal
consistency through a learnable optical flow module. Adversarial training of
the autoencoder in the appearance and temporal domains is used to further
improve the realism of its output. A quantizer inserted between the encoder and
the transformer in charge of forecasting future frames in latent space (and its
inverse inserted between the transformer and the decoder) adds even more
flexibility by affording simple mechanisms for handling multimodal ancillary
information for controlling the synthesis process (eg, a few sample frames, an
audio track, a trajectory in image space) and taking into account the
intrinsically uncertain nature of the future by allowing multiple predictions.
Experiments with an implementation of the proposed approach give very good
qualitative and quantitative results on multiple tasks and standard benchmarks.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z) - Unsupervised Multi-modal Feature Alignment for Time Series
Representation Learning [20.655943795843037]
We introduce an innovative approach that focuses on aligning and binding time series representations encoded from different modalities.
In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder.
Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.
arXiv Detail & Related papers (2023-12-09T22:31:20Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Pair-wise Layer Attention with Spatial Masking for Video Prediction [46.17429511620538]
We develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps.
We also present a Pair-wise Layer Attention with Spatial Masking (SM-SM) framework for Translator prediction.
arXiv Detail & Related papers (2023-11-19T10:29:05Z) - DynPoint: Dynamic Neural Point For View Synthesis [45.44096876841621]
We propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos.
DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation.
Our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.
arXiv Detail & Related papers (2023-10-29T12:55:53Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.