ALLURE: Auditing and Improving LLM-based Evaluation of Text using
Iterative In-Context-Learning
- URL: http://arxiv.org/abs/2309.13701v2
- Date: Wed, 27 Sep 2023 00:26:08 GMT
- Title: ALLURE: Auditing and Improving LLM-based Evaluation of Text using
Iterative In-Context-Learning
- Authors: Hosein Hasanbeig and Hiteshi Sharma and Leo Betthauser and Felipe
Vieira Frujeri and Ida Momennejad
- Abstract summary: Large language models (LLMs) are used for evaluation of text generated by humans and AI alike.
Despite their utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities.
Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors.
- Score: 7.457517083017178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From grading papers to summarizing medical documents, large language models
(LLMs) are evermore used for evaluation of text generated by humans and AI
alike. However, despite their extensive utility, LLMs exhibit distinct failure
modes, necessitating a thorough audit and improvement of their text evaluation
capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large
Language Models Understanding and Reasoning Errors. ALLURE involves comparing
LLM-generated evaluations with annotated data, and iteratively incorporating
instances of significant deviation into the evaluator, which leverages
in-context learning (ICL) to enhance and improve robust evaluation of text by
LLMs. Through this iterative process, we refine the performance of the
evaluator LLM, ultimately reducing reliance on human annotators in the
evaluation process. We anticipate ALLURE to serve diverse applications of LLMs
in various domains related to evaluation of textual data, such as medical
summarization, education, and and productivity.
Related papers
- Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation [2.5398014196797605]
This study introduces textbfInteractEval, a framework that integrates human expertise and Large Language Models (LLMs)
The framework uses the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation.
arXiv Detail & Related papers (2024-09-11T15:40:07Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization [17.38671584773247]
This research investigates prompt designs of evaluating generated texts using large language models (LLMs)
We found that the order of presenting reasons and scores significantly influences LLMs' scoring.
An additional optimization may enhance scoring alignment if sufficient data is available.
arXiv Detail & Related papers (2024-06-14T12:31:44Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.