Post Turing: Mapping the landscape of LLM Evaluation
- URL: http://arxiv.org/abs/2311.02049v1
- Date: Fri, 3 Nov 2023 17:24:50 GMT
- Title: Post Turing: Mapping the landscape of LLM Evaluation
- Authors: Alexey Tikhonov, Ivan P. Yamshchikov
- Abstract summary: This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research.
We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models.
This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
- Score: 22.517544562890663
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the rapidly evolving landscape of Large Language Models (LLMs),
introduction of well-defined and standardized evaluation methodologies remains
a crucial challenge. This paper traces the historical trajectory of LLM
evaluations, from the foundational questions posed by Alan Turing to the modern
era of AI research. We categorize the evolution of LLMs into distinct periods,
each characterized by its unique benchmarks and evaluation criteria. As LLMs
increasingly mimic human-like behaviors, traditional evaluation proxies, such
as the Turing test, have become less reliable. We emphasize the pressing need
for a unified evaluation system, given the broader societal implications of
these models. Through an analysis of common evaluation methodologies, we
advocate for a qualitative shift in assessment approaches, underscoring the
importance of standardization and objective criteria. This work serves as a
call for the AI community to collaboratively address the challenges of LLM
evaluation, ensuring their reliability, fairness, and societal benefit.
Related papers
- LLM-based relevance assessment still can't replace human relevance assessment [12.829823535454505]
Recent studies suggest that large language models (LLMs) for relevance assessment in information retrieval provide comparable evaluations to human judgments.
Upadhyay et al. claim that LLM-based relevance assessments can fully replace traditional human relevance assessments in TREC-style evaluations.
This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion.
arXiv Detail & Related papers (2024-12-22T20:45:15Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.
LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.
We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks.
Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.