RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
- URL: http://arxiv.org/abs/2410.05193v1
- Date: Mon, 7 Oct 2024 16:50:47 GMT
- Title: RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
- Authors: Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma,
- Abstract summary: RevisEval is a novel text generation evaluation paradigm via the response-adapted references.
RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated.
- Score: 95.29800580588592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.
Related papers
- Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models [7.529095331830944]
In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance.
We propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment.
arXiv Detail & Related papers (2024-07-10T10:42:02Z) - Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation [51.8188846284153]
RAG has been widely adopted to enhance Large Language Models (LLMs)
Attributed Text Generation (ATG) has attracted growing attention, which provides citations to support the model's responses in RAG.
This paper proposes a fine-grained ATG method called ReClaim(Refer & Claim), which alternates the generation of references and answers step by step.
arXiv Detail & Related papers (2024-07-01T20:47:47Z) - From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications [26.857056013032263]
evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications.
Our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications.
arXiv Detail & Related papers (2024-04-10T15:46:08Z) - MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues.
We propose the MATEval: A "Multi-Agent Text Evaluation framework"
Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z) - CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - DocAsRef: An Empirical Study on Repurposing Reference-Based Summary
Quality Metrics Reference-Freely [29.4981129248937]
We propose that some reference-based metrics can be effectively adapted to assess a system summary against its corresponding reference.
After being repurposed reference-freely, the zero-shot BERTScore consistently outperforms its original reference-based version.
It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
arXiv Detail & Related papers (2022-12-20T06:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.