LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
- URL: http://arxiv.org/abs/2501.00274v1
- Date: Tue, 31 Dec 2024 04:57:01 GMT
- Title: LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
- Authors: Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie,
- Abstract summary: This paper introduces a framework for the automated evaluation of natural language texts.
A large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses.
LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another.
LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters.
- Score: 44.42036618863284
- License:
- Abstract: This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.
Related papers
- Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs.
Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.
We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.
Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants [16.932009464531742]
We propose a novel automatic evaluation task: HumanRankEval.
It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans.
We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
arXiv Detail & Related papers (2024-05-15T08:47:26Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.