BatchEval: Towards Human-like Text Evaluation
- URL: http://arxiv.org/abs/2401.00437v1
- Date: Sun, 31 Dec 2023 09:34:51 GMT
- Title: BatchEval: Towards Human-like Text Evaluation
- Authors: Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda
Wang, Kan Li
- Abstract summary: BatchEval is a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems.
We show that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with only 64% API cost on average.
- Score: 12.187982795098623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in automatic text evaluation with the
introduction of large language models (LLMs) as evaluators. However, current
sample-wise evaluation paradigm suffers from the following issues: (1)
Sensitive to prompt design; (2) Poor resistance to noise; (3) Inferior ensemble
performance with static reference. Inspired by the fact that humans treat both
criterion definition and inter sample comparison as references for evaluation,
we propose BatchEval, a paradigm that conducts batch-wise evaluation
iteratively to alleviate the above problems. We explore variants under this
paradigm and confirm the optimal settings are two stage procedure with
heterogeneous batch composition strategy and decimal scoring format.
Comprehensive experiments across 3 LLMs on 4 text evaluation tasks demonstrate
that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson
correlations with only 64% API cost on average. Further analyses have been
conducted to verify the robustness, generalization, and working mechanism of
BatchEval.
Related papers
- HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation [25.193026443079987]
HypoEval is a Hypothesis-guided Evaluation framework for large language models (LLMs)
With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation)
We conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
arXiv Detail & Related papers (2025-04-09T18:00:01Z) - SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.
SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.
We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists [12.542045913426639]
CheckEval is a checklist-based evaluation framework that improves rating reliability via binary questions.
CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization.
We propose Unified Multi-scenario Summarization Evaluation Model (UMSE)
Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.