Related papers: A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

URL: http://arxiv.org/abs/2407.04069v1
Date: Thu, 4 Jul 2024 17:15:37 GMT
Title: A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang,
Abstract summary: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities. We systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
Score: 35.12731651234186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

Related papers

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability [2.8948274245812327]
We study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation.<n>Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.
arXiv Detail & Related papers (2025-06-16T16:04:43Z)
Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks [18.613353004764885]
This study reveals novel insights into the limitations of existing methods.<n>We propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
arXiv Detail & Related papers (2025-05-28T15:17:34Z)
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations [29.031539043555362]
Large Language Models (LLMs) are increasingly used to evaluate information systems. Recent studies suggest that LLM-based evaluations often align with human judgments. This paper examines scenarios where LLM-evaluators may falsely indicate success.
arXiv Detail & Related papers (2025-04-27T02:14:21Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise. We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review [36.05498398665352]
We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review. Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects.
arXiv Detail & Related papers (2025-02-18T03:50:06Z)
Evaluating the Consistency of LLM Evaluators [9.53888551630878]
Large language models (LLMs) have shown potential as general evaluators. consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators.
arXiv Detail & Related papers (2024-11-30T17:29:08Z)
A Survey on LLM-as-a-Judge [10.257160590560824]
Large Language Models (LLMs) have achieved remarkable success across diverse domains. LLMs present a compelling alternative to traditional expert-driven evaluations. This paper addresses the core question: How can reliable LLM-as-a-Judge systems be built?
arXiv Detail & Related papers (2024-11-23T16:03:35Z)
ReIFE: Re-evaluating Instruction-Following Evaluation [105.75525154888655]
We present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 proposed evaluation protocols. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness.
arXiv Detail & Related papers (2024-10-09T17:14:50Z)
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models. To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios. To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Unveiling Bias in Fairness Evaluations of Large Language Models: A Critical Literature Review of Music and Movie Recommendation Systems [0.0]
The rise of generative artificial intelligence, particularly Large Language Models (LLMs), has intensified the imperative to scrutinize fairness alongside accuracy. Recent studies have begun to investigate fairness evaluations for LLMs within domains such as recommendations. Yet, the degree to which current fairness evaluation frameworks account for personalization remains unclear.
arXiv Detail & Related papers (2024-01-08T17:57:29Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.