Related papers: GPTScore: Evaluate as You Desire

GPTScore: Evaluate as You Desire

URL: http://arxiv.org/abs/2302.04166v1
Date: Wed, 8 Feb 2023 16:17:29 GMT
Title: GPTScore: Evaluate as You Desire
Authors: Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu
Abstract summary: This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that GPTScore can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.
Score: 40.111346987131974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at https://github.com/jinlanfu/GPTScore.

Related papers

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z)
Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents. Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
MOCHA: A Multi-Task Training Approach for Coherent Text Generation from Cognitive Perspective [22.69509556890676]
We propose a novel multi-task training strategy for coherent text generation grounded on the cognitive theory of writing. We extensively evaluate our model on three open-ended generation tasks including story generation, news article writing and argument generation.
arXiv Detail & Related papers (2022-10-26T11:55:41Z)
A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z)
KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.