Constantly Improving Image Models Need Constantly Improving Benchmarks
- URL: http://arxiv.org/abs/2510.15021v1
- Date: Thu, 16 Oct 2025 17:59:30 GMT
- Title: Constantly Improving Image Models Need Constantly Improving Benchmarks
- Authors: Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan,
- Abstract summary: We present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use.<n>Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from social media posts.
- Score: 109.39018167487103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.
Related papers
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z) - OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z) - Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models [0.0]
This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models.<n>Our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
arXiv Detail & Related papers (2025-05-06T18:53:34Z) - ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing [23.512687688393346]
ICE-Bench is a comprehensive benchmark designed to rigorously assess image generation models.<n>The evaluation framework assesses image generation capabilities across 6 dimensions.<n>We conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements.
arXiv Detail & Related papers (2025-03-18T17:53:29Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
KITTEN is a benchmark for Knowledge-InTensive image generaTion on real-world ENtities.<n>We conduct a systematic study of the latest text-to-image models and retrieval-augmented models.<n>Analysis shows that even advanced text-to-image models fail to generate accurate visual details of entities.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - Fashion Image-to-Image Translation for Complementary Item Retrieval [13.88174783842901]
We introduce the Generative Compatibility Model (GeCo), a two-stage approach that improves fashion image retrieval through paired image-to-image translation.
Evaluations on three datasets show that GeCo outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-08-19T09:50:20Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.