Related papers: StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

URL: http://arxiv.org/abs/2312.00330v1
Date: Fri, 1 Dec 2023 03:53:21 GMT
Title: StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
Authors: Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan
Abstract summary: StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
Score: 74.68550659331405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

Related papers

StyleMaster: Stylize Your Video with Artistic Generation and Translation [43.808656030545556]
Style control has been popular in video generation models. Current methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our approach, StyleMaster, achieves significant improvement in both style resemblance and temporal coherence.
arXiv Detail & Related papers (2024-12-10T18:44:08Z)
StyleBrush: Style Extraction and Transfer from a Single Image [19.652575295703485]
Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features. We propose StyleBrush, a method that accurately captures styles from a reference image and brushes'' the extracted style onto other input visual content.
arXiv Detail & Related papers (2024-08-18T14:27:20Z)
Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning [11.900404048019594]
In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand relationships among various styles.
arXiv Detail & Related papers (2024-04-21T08:52:22Z)
Style Aligned Image Generation via Shared Attention [61.121465570763085]
We introduce StyleAligned, a technique designed to establish style alignment among a series of generated images. By employing minimal attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. Our method's evaluation across diverse styles and text prompts demonstrates high-quality and fidelity.
arXiv Detail & Related papers (2023-12-04T18:55:35Z)
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos. To improve generalization, we show that one model can be trained with multiple text styles. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
StyleAdapter: A Unified Stylized Image Generation Model [97.24936247688824]
StyleAdapter is a unified stylized image generation model capable of producing a variety of stylized images. It can be integrated with existing controllable synthesis methods, such as T2I-adapter and ControlNet.
arXiv Detail & Related papers (2023-09-04T19:16:46Z)
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z)
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization [66.42741426640633]
DiffStyler is a dual diffusion processing architecture to control the balance between the content and style of diffused results. We propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image.
arXiv Detail & Related papers (2022-11-19T12:30:44Z)
Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation [13.894251782142584]
Diffusion-based text-to-image generation models like GLIDE and DALLE-2 have gained wide success recently. We propose a novel style guidance method to support generating images using arbitrary style guided by a reference image.
arXiv Detail & Related papers (2022-11-14T20:52:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.