$\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
- URL: http://arxiv.org/abs/2504.13143v1
- Date: Thu, 17 Apr 2025 17:51:59 GMT
- Title: $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
- Authors: Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie,
- Abstract summary: We introduce $ttexttComplex-Edit$, a comprehensive benchmark designed to evaluate instruction-based image editing models.<n>We harness GPT-4o to automatically collect a diverse set of editing instructions at scale.<n>We introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline.
- Score: 36.58090024531738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
Related papers
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [70.01883340129204]
Single-Pass.<n>with Reference-Guided Evaluation (SPARE)<n>Novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation.<n>SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies [13.525744033075785]
Real-world scenarios often involve complex, multi-step instructions, particularly chain'' instructions where operations are interdependent.<n>Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities.<n>We introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks.
arXiv Detail & Related papers (2025-06-15T12:22:55Z) - CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z) - A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention [11.999319439383918]
This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data.
A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a multi-scale attention mechanism.
BiLSTM captures both forward and backward dependencies in sequences, enhancing the model's ability to perceive global contextual structures.
arXiv Detail & Related papers (2025-04-21T16:53:02Z) - Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning [8.889313669713918]
We propose a novel contrastive learning framework for heterogeneous graphs (ASHGCL)
ASHGCL incorporates three distinct views, each focusing on node attributes, high-order and low-order structural information, respectively.
We introduce an attribute-enhanced positive sample selection strategy that combines both structural information and attribute information.
arXiv Detail & Related papers (2025-03-18T05:15:21Z) - MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training [36.483136685734735]
We propose a Multi-granularity Self-Contrastive Training (MuSC) framework to improve the complex instruction alignment without relying on a stronger model.<n>Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks.
arXiv Detail & Related papers (2025-02-17T08:12:49Z) - Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning [30.82220015525281]
Mosaic Instruction Tuning (Mosaic-IT) is a human/model-free compositional data synthesis method.<n>Our evaluations demonstrate a superior performance and training efficiency of Mosaic-IT.
arXiv Detail & Related papers (2024-05-22T04:08:20Z) - FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema [36.65009632307124]
We propose Free-from Instruction-oriented Prompt Optimization (FIPO) to improve task performance of large language models (LLMs)<n>FIPO uses a modular APO template that dynamically integrate the naive task instruction, optional instruction responses, and optional ground truth to produce finely optimized prompts.<n>We validate FIPO framework across five public benchmarks and six testing models.
arXiv Detail & Related papers (2024-02-19T03:56:44Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning [111.01953096869947]
Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)<n>We develop a systematic approach to automatically create high-quality complex visual reasoning instructions.<n> Experimental results consistently demonstrate the enhanced performance of all compared MLLMs.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - Exploring Format Consistency for Instruction Tuning [79.0698403613366]
In this work, we propose a framework named Unified Instruction Tuning (UIT)
UIT calls OpenAI APIs for automatic format transfer among different instruction tuning datasets such as PromptSource, FLAN and CrossFit.
With the framework, we demonstrate the necessity of maintaining format consistency in instruction tuning; (2) improve the generalization performance on unseen instructions on T5-LM-xl; and (3) provide a novel perplexity-based denoising method to reduce the noise of automatic format transfer.
arXiv Detail & Related papers (2023-07-28T12:00:13Z) - ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis.
We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z) - Evaluating Modules in Graph Contrastive Learning [29.03038320344791]
We propose a framework that decomposes graph contrastive learning models into four modules.
We conduct experiments on node and graph classification tasks.
We release our implementations and results as OpenGCL, a modularized toolkit.
arXiv Detail & Related papers (2021-06-15T14:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.