Related papers: GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

URL: http://arxiv.org/abs/2401.04092v2
Date: Tue, 9 Jan 2024 21:55:12 GMT
Title: GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Authors: Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein
Abstract summary: This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts. We then design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria.
Score: 93.55550787058012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each, such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies, however, can be very expensive to scale. This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts, which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally, we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria.

Related papers

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects [13.333670988010864]
We introduce Gen3DEval, a novel evaluation framework for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations.
arXiv Detail & Related papers (2025-04-10T20:57:40Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Most studies evaluate their results with subjective case studies and user experiments. We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation. We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z)
Multidimensional Evaluation for Text Style Transfer Using ChatGPT [14.799109368073548]
We investigate the potential of ChatGPT as a multidimensional evaluator for the task of emphText Style Transfer We test its performance on three commonly-used dimensions of text style transfer evaluation: style strength, content preservation, and fluency. These preliminary results are expected to provide a first glimpse into the role of large language models in the multidimensional evaluation of stylized text generation.
arXiv Detail & Related papers (2023-04-26T11:33:35Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)
On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text. We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.