Related papers: NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

URL: http://arxiv.org/abs/2505.12235v1
Date: Sun, 18 May 2025 05:09:47 GMT
Title: NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation
Authors: Jia Li, Nan Gao, Huaibo Huang, Ran He,
Abstract summary: diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation.<n>We propose a noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images.
Score: 70.96827354717459
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.

Related papers

FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation [5.621818036193247]
We propose a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective.<n>Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I.<n>We propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; and (3) extend model functionality to enable localized image
arXiv Detail & Related papers (2026-01-27T02:39:20Z)
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z)
T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models [89.29541056113442]
T2SMark is a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS)<n>We evaluate T2SMark on diffusion models with both U-Net and DiT backbones.
arXiv Detail & Related papers (2025-10-25T16:55:55Z)
LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection [11.700935740718675]
LATTE - Latent Trajectory Embedding - is a novel approach that models the evolution of latent embeddings across several denoising timesteps.<n>By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images.
arXiv Detail & Related papers (2025-07-03T12:53:47Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner. Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z)
Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation [17.30877810859863]
Large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I)<n>This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework.
arXiv Detail & Related papers (2024-07-03T11:05:19Z)
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS) A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z)
DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks. We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT) DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z)
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z)
Controlling Text-to-Image Diffusion by Orthogonal Finetuning [74.21549380288631]
We introduce a principled finetuning method -- Orthogonal Finetuning (OFT) for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
arXiv Detail & Related papers (2023-06-12T17:59:23Z)
Breaking Through the Haze: An Advanced Non-Homogeneous Dehazing Method based on Fast Fourier Convolution and ConvNeXt [14.917290578644424]
Haze usually leads to deteriorated images with low contrast, color shift and structural distortion. We propose a novel two branch network that leverages 2D discrete wavelete transform (DWT), fast Fourier convolution (FFC) residual block and a pretrained ConvNeXt model. Our model is able to effectively explore global contextual information and produce images with better perceptual quality.
arXiv Detail & Related papers (2023-05-08T02:59:02Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.