PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
- URL: http://arxiv.org/abs/2510.07546v1
- Date: Wed, 08 Oct 2025 21:02:55 GMT
- Title: PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
- Authors: Soroush Mehraban, Vida Adeli, Jacob Rommann, Babak Taati, Kyryl Truskovskyi,
- Abstract summary: PickStyle is a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters.<n>To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images.<n>CS-CFG ensures that context is preserved in generated video while the style is effectively transferred.
- Score: 1.9039773121452204
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.
Related papers
- FreeViS: Training-free Video Stylization with Inconsistent References [57.411689597435334]
FreeViS is a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence.<n>Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works.
arXiv Detail & Related papers (2025-10-02T05:27:06Z) - SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models [54.641809532055916]
We introduce SOYO, a novel diffusion-based framework for video style morphing.<n>Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency.<n>To harmonize across video frames, we propose a novel adaptive sampling scheduler between two style images.
arXiv Detail & Related papers (2025-03-10T07:27:01Z) - StyleMaster: Stylize Your Video with Artistic Generation and Translation [43.808656030545556]
Style control has been popular in video generation models.<n>Current methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style.<n>Our approach, StyleMaster, achieves significant improvement in both style resemblance and temporal coherence.
arXiv Detail & Related papers (2024-12-10T18:44:08Z) - UniVST: A Unified Framework for Training-free Localized Video Style Transfer [102.52552893495475]
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models.<n>It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos.
arXiv Detail & Related papers (2024-10-26T05:28:02Z) - StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter [78.75422651890776]
StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter.
To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image.
StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
arXiv Detail & Related papers (2023-12-01T03:53:21Z) - WAIT: Feature Warping for Animation to Illustration video Translation using GANs [11.968412857420192]
We introduce a new problem for video stylizing where an unordered set of images are used.<n>Most of the video-to-video translation methods are built on an image-to-image translation model.<n>We propose a new generator network with feature warping layers which overcomes the limitations of the previous methods.
arXiv Detail & Related papers (2023-10-07T19:45:24Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video.
Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization.
Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z) - Arbitrary Video Style Transfer via Multi-Channel Correlation [84.75377967652753]
We propose Multi-Channel Correction network (MCCNet) to fuse exemplar style features and input content features for efficient style transfer.
MCCNet works directly on the feature space of style and content domain where it learns to rearrange and fuse style features based on similarity with content features.
The outputs generated by MCC are features containing the desired style patterns which can further be decoded into images with vivid style textures.
arXiv Detail & Related papers (2020-09-17T01:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.