Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
- URL: http://arxiv.org/abs/2505.00502v1
- Date: Thu, 01 May 2025 13:06:05 GMT
- Title: Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
- Authors: Suho Ryu, Kihyun Kim, Eugene Baek, Dongsoo Shin, Joonseok Lee,
- Abstract summary: We introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE)<n>HATIE provides a fully-automated and omnidirectional evaluation pipeline.<n>We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects.
- Score: 9.899869794429579
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.
Related papers
- Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance [8.216807467478281]
evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences.<n>We propose cFreD, a metric that accounts for both visual fidelity and text-prompt alignment.<n>Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models.
arXiv Detail & Related papers (2025-03-27T17:35:14Z) - Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts [23.48816491333345]
textbfGamma, a textbfGeneric imtextbfAge assesstextbfMent model, can effectively assess images from diverse scenes through mixed-dataset training.<n>Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios.
arXiv Detail & Related papers (2025-03-09T16:07:58Z) - Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject.<n>Current methods face challenges in ensuring fidelity to the text prompt.<n>We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI.
Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z) - I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models.
I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions.
We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z) - EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods [52.43439659492655]
We introduce EditVal, a standardized benchmark for quantitatively evaluating text-guided image editing methods.
EditVal consists of a curated dataset of images, a set of editable attributes for each image drawn from 13 possible edit types, and an automated evaluation pipeline.
We use EditVal to benchmark 8 cutting-edge diffusion-based editing methods including SINE, Imagic and Instruct-Pix2Pix.
arXiv Detail & Related papers (2023-10-03T20:46:10Z) - HIVE: Harnessing Human Feedback for Instructional Visual Editing [127.29436858998064]
We present a novel framework to harness human feedback for instructional visual editing (HIVE)
Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences.
We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.
arXiv Detail & Related papers (2023-03-16T19:47:41Z) - TeTIm-Eval: a novel curated evaluation data set for comparing
text-to-image models [1.1252184947601962]
evaluating and comparing text-to-image models is a challenging problem.
In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images.
Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score.
arXiv Detail & Related papers (2022-12-15T13:52:03Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.