NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation
- URL: http://arxiv.org/abs/2505.12235v1
- Date: Sun, 18 May 2025 05:09:47 GMT
- Title: NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation
- Authors: Jia Li, Nan Gao, Huaibo Huang, Ran He,
- Abstract summary: diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation.<n>We propose a noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images.
- Score: 70.96827354717459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.
Related papers
- LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection [11.700935740718675]
LATTE - Latent Trajectory Embedding - is a novel approach that models the evolution of latent embeddings across several denoising timesteps.<n>By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images.
arXiv Detail & Related papers (2025-07-03T12:53:47Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner.
Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z) - Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation [17.30877810859863]
Large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I)<n>This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework.
arXiv Detail & Related papers (2024-07-03T11:05:19Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Controlling Text-to-Image Diffusion by Orthogonal Finetuning [74.21549380288631]
We introduce a principled finetuning method -- Orthogonal Finetuning (OFT) for adapting text-to-image diffusion models to downstream tasks.
Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere.
We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
arXiv Detail & Related papers (2023-06-12T17:59:23Z) - Breaking Through the Haze: An Advanced Non-Homogeneous Dehazing Method
based on Fast Fourier Convolution and ConvNeXt [14.917290578644424]
Haze usually leads to deteriorated images with low contrast, color shift and structural distortion.
We propose a novel two branch network that leverages 2D discrete wavelete transform (DWT), fast Fourier convolution (FFC) residual block and a pretrained ConvNeXt model.
Our model is able to effectively explore global contextual information and produce images with better perceptual quality.
arXiv Detail & Related papers (2023-05-08T02:59:02Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.