TISE: A Toolbox for Text-to-Image Synthesis Evaluation
- URL: http://arxiv.org/abs/2112.01398v1
- Date: Thu, 2 Dec 2021 16:39:35 GMT
- Title: TISE: A Toolbox for Text-to-Image Synthesis Evaluation
- Authors: Tan M. Dinh, Rang Nguyen, Binh-Son Hua
- Abstract summary: We conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis.
We propose a common framework for evaluating these methods.
- Score: 9.092600296992925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we conduct a study on state-of-the-art methods for single- and
multi-object text-to-image synthesis and propose a common framework for
evaluating these methods. We first identify several common issues in the
current evaluation of text-to-image models, which are: (i) a commonly used
metric for image quality assessment, e.g., Inception Score (IS), is often
either miscalibrated for the single-object case or misused for the multi-object
case; (ii) the overfitting phenomenon appears in the existing R-precision (RP)
and SOA metrics, which are used to assess text relevance and object accuracy
aspects, respectively; (iii) many vital factors in the evaluation of the
multi-object case are primarily dismissed, e.g., object fidelity, positional
alignment, counting alignment; (iv) the ranking of the methods based on current
metrics is highly inconsistent with real images. Then, to overcome these
limitations, we propose a combined set of existing and new metrics to
systematically evaluate the methods. For existing metrics, we develop an
improved version of IS named IS* by using temperature scaling to calibrate the
confidence of the classifier used by IS; we also propose a solution to mitigate
the overfitting issues of RP and SOA. Regarding a set of new metrics
compensating for the lacking of vital evaluating factors in the multi-object
case, we develop CA for counting alignment, PA for positional alignment,
object-centric IS (O-IS), object-centric FID (O-FID) for object fidelity. Our
benchmark, therefore, results in a highly consistent ranking among existing
methods, being well-aligned to human evaluation. We also create a strong
baseline model (AttnGAN++) for the benchmark by a simple modification from the
well-known AttnGAN. We will release this toolbox for unified evaluation,
so-called TISE, to standardize the evaluation of the text-to-image synthesis
models.
Related papers
- Dynamic Correlation Learning and Regularization for Multi-Label Confidence Calibration [60.95748658638956]
This paper introduces the Multi-Label Confidence task, aiming to provide well-calibrated confidence scores in multi-label scenarios.
Existing single-label calibration methods fail to account for category correlations, which are crucial for addressing semantic confusion.
We propose the Dynamic Correlation Learning and Regularization algorithm, which leverages multi-grained semantic correlations to better model semantic confusion.
arXiv Detail & Related papers (2024-07-09T13:26:21Z) - CrossScore: Towards Multi-View Image Evaluation and Scoring [24.853612457257697]
Cross-reference image quality assessment method fills the gap in the image assessment landscape.
Our method enables accurate image quality assessment without requiring ground truth references.
arXiv Detail & Related papers (2024-04-22T17:59:36Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - Low-shot Object Learning with Mutual Exclusivity Bias [27.67152913041082]
This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias.
We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task.
arXiv Detail & Related papers (2023-12-06T14:54:10Z) - For A More Comprehensive Evaluation of 6DoF Object Pose Tracking [22.696375341994035]
We contribute a unified benchmark to address the above problems.
For more accurate annotation of YCBV, we propose a multi-view multi-object global pose refinement method.
In experiments, we validate the precision and reliability of the proposed global pose refinement method with a realistic semi-synthesized dataset.
arXiv Detail & Related papers (2023-09-14T15:35:08Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation [67.12857074801731]
We introduce a novel method, CPPF++, designed for sim-to-real pose estimation.
To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty.
We incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a feature ensemble.
arXiv Detail & Related papers (2022-11-24T03:27:00Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.