Large language models streamline automated systematic review: A preliminary study
- URL: http://arxiv.org/abs/2502.15702v1
- Date: Thu, 09 Jan 2025 01:59:35 GMT
- Title: Large language models streamline automated systematic review: A preliminary study
- Authors: Xi Chen, Xue Zhang,
- Abstract summary: Large Language Models (LLMs) have shown promise in natural language processing tasks, with the potential to automate systematic reviews.<n>This study evaluates the performance of three state-of-the-art LLMs in conducting systematic review tasks.
- Score: 12.976248955642037
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown promise in natural language processing tasks, with the potential to automate systematic reviews. This study evaluates the performance of three state-of-the-art LLMs in conducting systematic review tasks. We assessed GPT-4, Claude-3, and Mistral 8x7B across four systematic review tasks: study design formulation, search strategy development, literature screening, and data extraction. Sourced from a previously published systematic review, we provided reference standard including standard PICO (Population, Intervention, Comparison, Outcome) design, standard eligibility criteria, and data from 20 reference literature. Three investigators evaluated the quality of study design and eligibility criteria using 5-point Liker Scale in terms of accuracy, integrity, relevance, consistency and overall performance. For other tasks, the output is defined as accurate if it is the same as the reference standard. Search strategy performance was evaluated through accuracy and retrieval efficacy. Screening accuracy was assessed for both abstracts screening and full texts screening. Data extraction accuracy was evaluated across 1,120 data points comprising 3,360 individual fields. Claude-3 demonstrated superior overall performance in PICO design. In search strategy formulation, GPT-4 and Claude-3 achieved comparable accuracy, outperforming Mistral. For abstract screening, GPT-4 achieved the highest accuracy, followed by Mistral and Claude-3. In data extraction, GPT-4 significantly outperformed other models. LLMs demonstrate potential for automating systematic review tasks, with GPT-4 showing superior performance in search strategy formulation, literature screening and data extraction. These capabilities make them promising assistive tools for researchers and warrant further development and validation in this field.
Related papers
- Empirical evaluation of LLMs in predicting fixes of Configuration bugs in Smart Home System [0.0]
This study evaluates the effectiveness of Large Language Models (LLMs) in predicting fixes for configuration bugs in smart home systems.<n>The research analyzes three prominent LLMs - GPT-4, GPT-4o (GPT-4 Turbo), and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-02-16T02:11:36Z) - Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) [0.5434005537854512]
This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS)
We compared the performance of four state-of-the-art LLMs in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting.
arXiv Detail & Related papers (2025-01-21T04:05:45Z) - ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery [23.773528748933934]
We extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them.
We unify the target output for every task to a self-contained Python program file.
We propose two effective strategies to mitigate data contamination concerns.
arXiv Detail & Related papers (2024-10-07T14:33:50Z) - AI based Multiagent Approach for Requirements Elicitation and Analysis [3.9422957660677476]
This study empirically investigates the effectiveness of utilizing Large Language Models (LLMs) to automate requirements analysis tasks.
We deployed four models, namely GPT-3.5, GPT-4 Omni, LLaMA3-70, and Mixtral-8B, and conducted experiments to analyze requirements on four real-world projects.
Preliminary results indicate notable variations in task completion among the models.
arXiv Detail & Related papers (2024-08-18T07:23:12Z) - Zero-shot Generative Large Language Models for Systematic Review
Screening Automation [55.403958106416574]
This study investigates the effectiveness of using zero-shot large language models for automatic screening.
We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
arXiv Detail & Related papers (2024-01-12T01:54:08Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Can large language models replace humans in the systematic review
process? Evaluating GPT-4's efficacy in screening and extracting data from
peer-reviewed and grey literature in multiple languages [0.0]
This study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction using a 'human-out-of-the-loop' approach.
GPT-4 had accuracy on par with human performance in most tasks, but results were skewed by chance agreement and dataset imbalance.
When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect'
arXiv Detail & Related papers (2023-10-26T16:18:30Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.