Related papers: LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

URL: http://arxiv.org/abs/2511.12635v1
Date: Sun, 16 Nov 2025 15:04:50 GMT
Title: LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews
Authors: Lech Madeyski, Barbara Kitchenham, Martin Shepperd,
Abstract summary: We identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in systematic reviews.<n>Major weaknesses included a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance.<n>On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built.
Score: 2.2175470459999636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.

Related papers

Measuring what Matters: Construct Validity in Large Language Model Benchmarks [103.53142193393931]
evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment.<n>We conduct a systematic review of 445 benchmarks from leading conferences in natural language processing and machine learning.<n>We find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims.
arXiv Detail & Related papers (2025-11-03T17:39:40Z)
Redefining Retrieval Evaluation in the Era of LLMs [20.75884808285362]
Traditional Information Retrieval (IR) metrics assume that human users sequentially examine documents with diminishing attention to lower ranks.<n>This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs)<n>We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones.
arXiv Detail & Related papers (2025-10-24T13:17:00Z)
AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume [2.2153783542347805]
This paper investigates challenges in automatic text summarization evaluation.<n>Based on experiments conducted across six representative metrics, we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting.<n>We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics.
arXiv Detail & Related papers (2025-08-29T08:05:00Z)
Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It [1.6261897792391753]
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi.<n>We uncover pervasive flaws in both benchmark items and evaluation methodology.
arXiv Detail & Related papers (2025-06-30T13:57:28Z)
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review [6.946630487078163]
Large Language Models (LLMs) have been transformative across many domains.<n>Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address misalignment between uncertainty and accuracy.<n>This survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
arXiv Detail & Related papers (2025-04-25T13:34:40Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.