Automated title and abstract screening for scoping reviews using the
GPT-4 Large Language Model
- URL: http://arxiv.org/abs/2311.07918v1
- Date: Tue, 14 Nov 2023 05:30:43 GMT
- Title: Automated title and abstract screening for scoping reviews using the
GPT-4 Large Language Model
- Authors: David Wilkins
- Abstract summary: GPTscreenR is a package for the R statistical programming language that uses the GPT-4 Large Language Model (LLM) to automatically screen sources.
In validation against consensus human reviewer decisions, GPTscreenR performed similarly to an alternative zero-shot technique, with a sensitivity of 71%, specificity of 89%, and overall accuracy of 84%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scoping reviews, a type of literature review, require intensive human effort
to screen large numbers of scholarly sources for their relevance to the review
objectives. This manuscript introduces GPTscreenR, a package for the R
statistical programming language that uses the GPT-4 Large Language Model (LLM)
to automatically screen sources. The package makes use of the chain-of-thought
technique with the goal of maximising performance on complex screening tasks.
In validation against consensus human reviewer decisions, GPTscreenR performed
similarly to an alternative zero-shot technique, with a sensitivity of 71%,
specificity of 89%, and overall accuracy of 84%. Neither method achieved
perfect accuracy nor human levels of intraobserver agreement. GPTscreenR
demonstrates the potential for LLMs to support scholarly work and provides a
user-friendly software framework that can be integrated into existing review
processes.
Related papers
- A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study [1.0787328610467801]
Large Language Models (LLMs) have shown impressive performance on several new tasks without updating the model's parameters.
This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features.
Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction.
arXiv Detail & Related papers (2024-09-11T10:21:13Z) - Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z) - Enhancing Robustness of LLM-Synthetic Text Detectors for Academic
Writing: A Comprehensive Analysis [35.351782110161025]
Large language models (LLMs) offer numerous advantages in terms of revolutionizing work and study methods.
They have also garnered significant attention due to their potential negative consequences.
One example is generating academic reports or papers with little to no human contribution.
arXiv Detail & Related papers (2024-01-16T01:58:36Z) - Zero-shot Generative Large Language Models for Systematic Review
Screening Automation [55.403958106416574]
This study investigates the effectiveness of using zero-shot large language models for automatic screening.
We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
arXiv Detail & Related papers (2024-01-12T01:54:08Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Can large language models replace humans in the systematic review
process? Evaluating GPT-4's efficacy in screening and extracting data from
peer-reviewed and grey literature in multiple languages [0.0]
This study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction using a 'human-out-of-the-loop' approach.
GPT-4 had accuracy on par with human performance in most tasks, but results were skewed by chance agreement and dataset imbalance.
When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect'
arXiv Detail & Related papers (2023-10-26T16:18:30Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Split and Merge: Aligning Position Biases in Large Language Model based
Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.