Related papers: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

URL: http://arxiv.org/abs/2504.17502v1
Date: Thu, 24 Apr 2025 12:44:51 GMT
Title: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Authors: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor,
Abstract summary: RefVNLI is a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction.<n>It outperforms or matches existing baselines across multiple benchmarks and subject categories.<n>It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.
Score: 27.336251972097077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

Related papers

Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject.<n>Current methods face challenges in ensuring fidelity to the text prompt.<n>We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z)
Visual question answering based evaluation metrics for text-to-image generation [7.105786967332924]
This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality.
arXiv Detail & Related papers (2024-11-15T13:32:23Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
Holistic Evaluation for Interleaved Text-and-Image Generation [19.041251355695973]
We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
arXiv Detail & Related papers (2024-06-20T18:07:19Z)
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context. We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available. We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z)
Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM) We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z)
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images. A higher likelihood indicates better perceptual quality and better text-image alignment. It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.