Can large language models replace humans in the systematic review
process? Evaluating GPT-4's efficacy in screening and extracting data from
peer-reviewed and grey literature in multiple languages
- URL: http://arxiv.org/abs/2310.17526v2
- Date: Fri, 27 Oct 2023 12:14:27 GMT
- Title: Can large language models replace humans in the systematic review
process? Evaluating GPT-4's efficacy in screening and extracting data from
peer-reviewed and grey literature in multiple languages
- Authors: Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch,
Kristin Hadfield
- Abstract summary: This study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction using a 'human-out-of-the-loop' approach.
GPT-4 had accuracy on par with human performance in most tasks, but results were skewed by chance agreement and dataset imbalance.
When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect'
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Systematic reviews are vital for guiding practice, research, and policy, yet
they are often slow and labour-intensive. Large language models (LLMs) could
offer a way to speed up and automate systematic reviews, but their performance
in such tasks has not been comprehensively evaluated against humans, and no
study has tested GPT-4, the biggest LLM so far. This pre-registered study
evaluates GPT-4's capability in title/abstract screening, full-text review, and
data extraction across various literature types and languages using a
'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human
performance in most tasks, results were skewed by chance agreement and dataset
imbalance. After adjusting for these, there was a moderate level of performance
for data extraction, and - barring studies that used highly reliable prompts -
screening performance levelled at none to moderate for different stages and
languages. When screening full-text literature using highly reliable prompts,
GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key
studies using highly reliable prompts improved its performance even more. Our
findings indicate that, currently, substantial caution should be used if LLMs
are being used to conduct systematic reviews, but suggest that, for certain
systematic review tasks delivered under reliable prompts, LLMs can rival human
performance.
Related papers
- An Empirical Study on Information Extraction using Large Language Models [36.090082785047855]
Human-like large language models (LLMs) have proven to be very helpful for many natural language processing (NLP) related tasks.
We propose and analyze the effects of a series of simple prompt-based methods on GPT-4's information extraction ability.
arXiv Detail & Related papers (2024-08-31T07:10:16Z) - Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews [7.355182982314533]
Large Language Models (LLMs) can be used to enhance the efficiency, speed, and precision of literature review filtering.
We show that using advanced LLMs with simple prompting can significantly reduce the time required for literature filtering.
We also show that false negatives can indeed be controlled through a consensus scheme, achieving recalls >98.8% at or even beyond the typical human error threshold.
arXiv Detail & Related papers (2024-07-15T12:13:53Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks.
In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science.
We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - An Empirical Study on Information Extraction using Large Language Models [36.090082785047855]
Human-like large language models (LLMs) have proven to be very helpful for many natural language processing (NLP) related tasks.
We propose and analyze the effects of a series of simple prompt-based methods on GPT-4's information extraction ability.
arXiv Detail & Related papers (2023-05-23T18:17:43Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.