Related papers: PromptFix: You Prompt and We Fix the Photo

PromptFix: You Prompt and We Fix the Photo

URL: http://arxiv.org/abs/2405.16785v2
Date: Thu, 10 Oct 2024 16:09:22 GMT
Title: PromptFix: You Prompt and We Fix the Photo
Authors: Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, Jiebo Luo,
Abstract summary: Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks. The lack of diverse instruction-following data hampers the development of models. We propose PromptFix, a framework that enables diffusion models to follow human instructions.
Score: 84.69812824355269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks. The dataset and code are available at https://www.yongshengyu.com/PromptFix-Page.

Related papers

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling [54.54513714247062]
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework.<n>We found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions.<n>We propose Self-Adaptive Attention Scaling to dynamically scale the attention activation for each sub-instruction.
arXiv Detail & Related papers (2025-07-22T05:25:38Z)
Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition [33.22316608406554]
We propose a parameter-efficient dual adaptation method for both image and text encoders.<n>Specifically, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction.<n>We develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions.
arXiv Detail & Related papers (2025-05-09T12:34:10Z)
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning [68.98988753763666]
We propose VisualCloze, a universal image generation framework. VisualCloze supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. We introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.
arXiv Detail & Related papers (2025-04-10T17:59:42Z)
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing [59.590505989071175]
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. We introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights.
arXiv Detail & Related papers (2025-03-16T21:11:25Z)
EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z)
Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays [11.15417027415116]
We propose a lightweight, text-guided, learned multitasking perceptual graphics model. Our model supports a variety of perceptual tasks, including foveated rendering, dynamic range enhancement, image denoising, and chromostereopsis. We evaluate our model's performance on embedded platforms and validate the perceptual quality of our model through a user study.
arXiv Detail & Related papers (2024-07-31T19:05:00Z)
Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis [3.783530340696776]
This study proposes a Multi-Agent framework to optimize input prompts for text-to-image generation models. A professional prompts database serves as a benchmark to guide the instruction modifier towards generating high-caliber prompts. Preliminary ablation studies highlight the effectiveness of various system components and suggest areas for future improvements.
arXiv Detail & Related papers (2024-06-13T00:33:29Z)
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks [38.6455393290578]
We propose DocRes, a model that unifies five document image restoration tasks including dewarping, deshadowing, appearance enhancement, deblurring, and binarization. To instruct DocRes to perform different restoration tasks, we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt) DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions.
arXiv Detail & Related papers (2024-05-07T15:35:43Z)
Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
Exposure Bracketing is All You Need for Unifying Image Restoration and Enhancement Tasks [50.822601495422916]
We propose to utilize exposure bracketing photography to unify image restoration and enhancement tasks. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data. In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed.
arXiv Detail & Related papers (2024-01-01T14:14:35Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z)
Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.