Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer
- URL: http://arxiv.org/abs/2305.05464v1
- Date: Tue, 9 May 2023 14:03:27 GMT
- Title: Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer
- Authors: Nisha Huang, Yuxin Zhang, Weiming Dong
- Abstract summary: This paper proposes a zero-shot video stylization method named Style-A-Video.
Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization.
Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
- Score: 13.098901971644656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-video diffusion models have demonstrated an exceptional
ability to synthesize diverse videos. However, due to the lack of extensive
text-to-video datasets and the necessary computational resources for training,
directly applying these models for video stylization remains difficult. Also,
given that the noise addition process on the input content is random and
destructive, fulfilling the style transfer task's content preservation criteria
is challenging. This paper proposes a zero-shot video stylization method named
Style-A-Video, which utilizes a generative pre-trained transformer with an
image latent diffusion model to achieve a concise text-controlled video
stylization. We improve the guidance condition in the denoising process,
establishing a balance between artistic expression and structure preservation.
Furthermore, to decrease inter-frame flicker and avoid the formation of
additional artifacts, we employ a sampling optimization and a temporal
consistency module. Extensive experiments show that we can attain superior
content preservation and stylistic performance while incurring less consumption
than previous solutions. Code will be available at
https://github.com/haha-lisa/Style-A-Video.
Related papers
- I4VGen: Image as Free Stepping Stone for Text-to-Video Generation [28.910648256877113]
We present I4VGen, a novel video diffusion inference pipeline to enhance pre-trained text-to-video diffusion models.
I4VGen consists of two stages: anchor image synthesis and anchor image-augmented text-to-video synthesis.
Experiments show that the proposed method produces videos with higher visual realism and textual fidelity datasets.
arXiv Detail & Related papers (2024-06-04T11:48:44Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models [19.792535444735957]
RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training.
It produces high-quality videos while preserving original motion and semantic structure.
RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations.
arXiv Detail & Related papers (2023-12-07T18:43:45Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - WAIT: Feature Warping for Animation to Illustration video Translation
using GANs [12.681919619814419]
We introduce a new problem for video stylizing where an unordered set of images are used.
Most of the video-to-video translation methods are built on an image-to-image translation model.
We propose a new generator network with feature warping layers which overcomes the limitations of the previous methods.
arXiv Detail & Related papers (2023-10-07T19:45:24Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.