LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair
- URL: http://arxiv.org/abs/2411.19156v3
- Date: Mon, 09 Dec 2024 02:14:55 GMT
- Title: LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair
- Authors: Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang,
- Abstract summary: We propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs.<n>We learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model.<n>Our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.
- Score: 116.48684498656871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users' intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.
Related papers
- DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition [69.10628479553709]
We introduce DRC, a novel personalized image generation framework that enhances Large Multimodal Models (LMMs)
DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively.
It involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation.
arXiv Detail & Related papers (2025-04-24T08:10:10Z) - Image-Editing Specialists: An RLAIF Approach for Diffusion Models [28.807572302899004]
We present a novel approach to training specialized instruction-based image-editing diffusion models.
We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences.
Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z) - Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data [39.17652541259225]
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text.
We introduce an embedding reformulation architecture that effectively combines image and text modalities.
Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets.
arXiv Detail & Related papers (2025-04-01T14:03:46Z) - A LoRA is Worth a Thousand Pictures [28.928964530616593]
Low Rank Adaptation (LoRA) can replicate an artist's style or subject using minimal data and computation.
We show that LoRA weights alone can serve as an effective descriptor of style, without the need for additional image generation or knowledge of the original training set.
We conclude with a discussion on potential future applications, such as zero-shot LoRA fine-tuning and model attribution.
arXiv Detail & Related papers (2024-12-16T18:18:17Z) - SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing [42.23117201457898]
We introduce a new framework that integrates large language model (LLM) with Text2 generative model for graph-based image editing.
Our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
arXiv Detail & Related papers (2024-10-15T17:40:48Z) - Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations [45.800383191637785]
We introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains.
We propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization.
We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset.
arXiv Detail & Related papers (2024-07-04T03:28:22Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing [77.12834553200632]
We introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset.
The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images.
When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not.
arXiv Detail & Related papers (2024-05-18T06:03:42Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Reference-based Image Composition with Sketch via Structure-aware
Diffusion Model [38.1193912666578]
We introduce a multi-input-conditioned image composition model that incorporates a sketch as a novel modal, alongside a reference image.
Thanks to the edge-level controllability using sketches, our method enables a user to edit or complete an image sub-part.
Our framework fine-tunes a pre-trained diffusion model to complete missing regions using the reference image while maintaining sketch guidance.
arXiv Detail & Related papers (2023-03-31T06:12:58Z) - InstructPix2Pix: Learning to Follow Image Editing Instructions [103.77092910685764]
We propose a method for editing images from human instructions.
given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image.
We show compelling editing results for a diverse collection of input images and written instructions.
arXiv Detail & Related papers (2022-11-17T18:58:43Z) - Image Difference Captioning with Pre-training and Contrastive Learning [45.59621065755761]
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.
The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations.
We propose a new modeling framework following the pre-training-finetuning paradigm to address these challenges.
arXiv Detail & Related papers (2022-02-09T06:14:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.