Related papers: EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

URL: http://arxiv.org/abs/2505.05209v3
Date: Sat, 05 Jul 2025 09:12:24 GMT
Title: EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
Authors: Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu,
Abstract summary: Enhancing Anything Model (EAM) is a novel Blind Super-Resolution method.<n>We introduce a novel block, $Psi$-DiT, which effectively guides the DiT to enhance image restoration.<n>EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
Score: 11.331361804059625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

Related papers

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing [21.859022356706838]
Transformer-based diffusion models have recently superseded traditional U-Net architectures.<n>MMDiT introduces a unified attention mechanism that performs a single full attention operation.<n>We propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits.
arXiv Detail & Related papers (2025-08-11T00:40:12Z)
Policy Optimized Text-to-Image Pipeline Design [72.87655664038617]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z)
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals. We highlight the crucial importance of tailoring datasets to specific learning objectives. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z)
Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR) In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks. We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks. We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs) By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.