Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation
- URL: http://arxiv.org/abs/2407.03006v1
- Date: Wed, 3 Jul 2024 11:05:19 GMT
- Title: Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation
- Authors: Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu,
- Abstract summary: Large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I)
This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework.
- Score: 17.30877810859863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. The code is publicly available at: https://github.com/XiangGao1102/FCDiffusion.
Related papers
- DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting [2.656795553429629]
We propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting.
Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments.
arXiv Detail & Related papers (2024-08-09T09:28:42Z) - FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner.
Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion [23.142097481682306]
We introduce S2ST, a novel framework designed to accomplish global I2IT in complex images.
S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter.
We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes.
arXiv Detail & Related papers (2023-11-30T18:59:49Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Dual Diffusion Implicit Bridges for Image-to-Image Translation [104.59371476415566]
Common image-to-image translation methods rely on joint training over data from both source and target domains.
We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models.
DDIBs allow translations between arbitrary pairs of source-target domains, given independently trained diffusion models on respective domains.
arXiv Detail & Related papers (2022-03-16T04:10:45Z) - Multi-domain Unsupervised Image-to-Image Translation with Appearance
Adaptive Convolution [62.4972011636884]
We propose a novel multi-domain unsupervised image-to-image translation (MDUIT) framework.
We exploit the decomposed content feature and appearance adaptive convolution to translate an image into a target appearance.
We show that the proposed method produces visually diverse and plausible results in multiple domains compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-06T14:12:34Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z) - Multi-Channel Attention Selection GANs for Guided Image-to-Image
Translation [148.9985519929653]
We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation.
The proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis.
arXiv Detail & Related papers (2020-02-03T23:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.