Two Birds, One Stone: A Unified Framework for Joint Learning of Image
and Video Style Transfers
- URL: http://arxiv.org/abs/2304.11335v2
- Date: Sat, 2 Sep 2023 02:17:02 GMT
- Title: Two Birds, One Stone: A Unified Framework for Joint Learning of Image
and Video Style Transfers
- Authors: Bohai Gu, Heng Fan, Libo Zhang
- Abstract summary: Current arbitrary style transfer models are limited to either image or video domains.
We introduce UniST, a Unified Style Transfer framework for both images and videos.
We show that UniST performs favorably against state-of-the-art approaches on both tasks.
- Score: 14.057935237805982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current arbitrary style transfer models are limited to either image or video
domains. In order to achieve satisfying image and video style transfers, two
different models are inevitably required with separate training processes on
image and video domains, respectively. In this paper, we show that this can be
precluded by introducing UniST, a Unified Style Transfer framework for both
images and videos. At the core of UniST is a domain interaction transformer
(DIT), which first explores context information within the specific domain and
then interacts contextualized domain information for joint learning. In
particular, DIT enables exploration of temporal information from videos for the
image style transfer task and meanwhile allows rich appearance texture from
images for video style transfer, thus leading to mutual benefits. Considering
heavy computation of traditional multi-head self-attention, we present a simple
yet effective axial multi-head self-attention (AMSA) for DIT, which improves
computational efficiency while maintains style transfer performance. To verify
the effectiveness of UniST, we conduct extensive experiments on both image and
video style transfer tasks and show that UniST performs favorably against
state-of-the-art approaches on both tasks. Code is available at
https://github.com/NevSNev/UniST.
Related papers
- UniVST: A Unified Framework for Training-free Localized Video Style Transfer [66.69471376934034]
This paper presents UniVST, a unified framework for localized video style transfer.
It operates without the need for training, offering a distinct advantage over existing methods that transfer style across entire videos.
arXiv Detail & Related papers (2024-10-26T05:28:02Z) - One-Shot Learning Meets Depth Diffusion in Multi-Object Videos [0.0]
This paper introduces a novel depth-conditioning approach that enables the generation of coherent and diverse videos from just a single text-video pair.
Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms.
During inference, we use the DDIM inversion to provide structural guidance for video generation.
arXiv Detail & Related papers (2024-08-29T16:58:10Z) - WAIT: Feature Warping for Animation to Illustration video Translation
using GANs [12.681919619814419]
We introduce a new problem for video stylizing where an unordered set of images are used.
Most of the video-to-video translation methods are built on an image-to-image translation model.
We propose a new generator network with feature warping layers which overcomes the limitations of the previous methods.
arXiv Detail & Related papers (2023-10-07T19:45:24Z) - A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive
Learning [84.8813842101747]
Unified Contrastive Arbitrary Style Transfer (UCAST) is a novel style representation learning and transfer framework.
We present an adaptive contrastive learning scheme for style transfer by introducing an input-dependent temperature.
Our framework consists of three key components, i.e., a parallel contrastive learning scheme for style representation and style transfer, a domain enhancement module for effective learning of style distribution, and a generative network for style transfer.
arXiv Detail & Related papers (2023-03-09T04:35:00Z) - ACE: Zero-Shot Image to Image Translation via Pretrained
Auto-Contrastive-Encoder [2.1874189959020427]
We propose a new approach to extract image features by learning the similarities and differences of samples within the same data distribution.
The design of ACE enables us to achieve zero-shot image-to-image translation with no training on image translation tasks for the first time.
Our model achieves competitive results on multimodal image translation tasks with zero-shot learning as well.
arXiv Detail & Related papers (2023-02-22T23:52:23Z) - Fine-Grained Image Style Transfer with Visual Transformers [59.85619519384446]
We propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation.
To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk.
arXiv Detail & Related papers (2022-10-11T06:26:00Z) - Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning [84.8813842101747]
Contrastive Arbitrary Style Transfer (CAST) is a new style representation learning and style transfer method via contrastive learning.
Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer.
arXiv Detail & Related papers (2022-05-19T13:11:24Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.