UniVST: A Unified Framework for Training-free Localized Video Style Transfer
- URL: http://arxiv.org/abs/2410.20084v3
- Date: Tue, 26 Nov 2024 09:16:50 GMT
- Title: UniVST: A Unified Framework for Training-free Localized Video Style Transfer
- Authors: Quanjian Song, Mingbao Lin, Wengyi Zhan, Shuicheng Yan, Liujuan Cao, Rongrong Ji,
- Abstract summary: This paper presents UniVST, a unified framework for localized video style transfer based on diffusion model.
It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos.
- Score: 102.52552893495475
- License:
- Abstract: This paper presents UniVST, a unified framework for localized video style transfer based on diffusion model. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided video style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.
Related papers
- ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps.
We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [33.46782517803435]
Make-Your-Anchor is a system requiring only a one-minute video clip of an individual for training.
We finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances.
A novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos.
arXiv Detail & Related papers (2024-03-25T07:54:18Z) - Diffusion-based Human Motion Style Transfer with Semantic Guidance [23.600154466988073]
We propose a novel framework for few-shot style transfer learning based on the diffusion model.
In the first stage, we pre-train a diffusion-based text-to-motion model as a generative prior.
In the second stage, based on the single style example, we fine-tune the pre-trained diffusion model in a few-shot manner to make it capable of style transfer.
arXiv Detail & Related papers (2024-03-20T05:52:11Z) - HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced
Diffusion Models [84.12784265734238]
The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video.
We propose HiCAST, which is capable of explicitly customizing the stylization results according to various source of semantic clues.
A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency.
arXiv Detail & Related papers (2024-01-11T12:26:23Z) - Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer [19.355744690301403]
We introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization.
Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.
arXiv Detail & Related papers (2023-12-11T09:53:12Z) - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video.
Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization.
Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z) - Two Birds, One Stone: A Unified Framework for Joint Learning of Image
and Video Style Transfers [14.057935237805982]
Current arbitrary style transfer models are limited to either image or video domains.
We introduce UniST, a Unified Style Transfer framework for both images and videos.
We show that UniST performs favorably against state-of-the-art approaches on both tasks.
arXiv Detail & Related papers (2023-04-22T07:15:49Z) - A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive
Learning [84.8813842101747]
Unified Contrastive Arbitrary Style Transfer (UCAST) is a novel style representation learning and transfer framework.
We present an adaptive contrastive learning scheme for style transfer by introducing an input-dependent temperature.
Our framework consists of three key components, i.e., a parallel contrastive learning scheme for style representation and style transfer, a domain enhancement module for effective learning of style distribution, and a generative network for style transfer.
arXiv Detail & Related papers (2023-03-09T04:35:00Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.