Zero-shot Generative Large Language Models for Systematic Review
Screening Automation
- URL: http://arxiv.org/abs/2401.06320v2
- Date: Thu, 1 Feb 2024 02:08:28 GMT
- Title: Zero-shot Generative Large Language Models for Systematic Review
Screening Automation
- Authors: Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan
Koopman, Guido Zuccon
- Abstract summary: This study investigates the effectiveness of using zero-shot large language models for automatic screening.
We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
- Score: 55.403958106416574
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Systematic reviews are crucial for evidence-based medicine as they
comprehensively analyse published research findings on specific questions.
Conducting such reviews is often resource- and time-intensive, especially in
the screening phase, where abstracts of publications are assessed for inclusion
in a review. This study investigates the effectiveness of using zero-shot large
language models~(LLMs) for automatic screening. We evaluate the effectiveness
of eight different LLMs and investigate a calibration technique that uses a
predefined recall threshold to determine whether a publication should be
included in a systematic review. Our comprehensive evaluation using five
standard test collections shows that instruction fine-tuning plays an important
role in screening, that calibration renders LLMs practical for achieving a
targeted recall, and that combining both with an ensemble of zero-shot models
saves significant screening time compared to state-of-the-art approaches.
Related papers
- Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews [7.355182982314533]
Large Language Models (LLMs) can be used to enhance the efficiency, speed, and precision of literature review filtering.
We show that using advanced LLMs with simple prompting can significantly reduce the time required for literature filtering.
We also show that false negatives can indeed be controlled through a consensus scheme, achieving recalls >98.8% at or even beyond the typical human error threshold.
arXiv Detail & Related papers (2024-07-15T12:13:53Z) - Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.
Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - A Survey of the Impact of Self-Supervised Pretraining for Diagnostic
Tasks with Radiological Images [71.26717896083433]
Self-supervised pretraining has been observed to be effective at improving feature representations for transfer learning.
This review summarizes recent research into its usage in X-ray, computed tomography, magnetic resonance, and ultrasound imaging.
arXiv Detail & Related papers (2023-09-05T19:45:09Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - A Thorough Examination on Zero-shot Dense Retrieval [84.70868940598143]
We present the first thorough examination of the zero-shot capability of dense retrieval (DR) models.
We discuss the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero-shot DR models.
arXiv Detail & Related papers (2022-04-27T07:59:07Z) - Best Practices and Scoring System on Reviewing A.I. based Medical
Imaging Papers: Part 1 Classification [0.9428556282541211]
The Machine Learning Education Sub-Committee of SIIM has identified a knowledge gap and a serious need to establish guidelines for reviewing these studies.
This first entry in the series focuses on the task of image classification.
The goal of this series is to provide resources to help improve the review process for A.I.-based medical imaging papers.
arXiv Detail & Related papers (2022-02-03T21:46:59Z) - Automating Document Classification with Distant Supervision to Increase
the Efficiency of Systematic Reviews [18.33687903724145]
Well-done systematic reviews are expensive, time-demanding, and labor-intensive.
We propose an automatic document classification approach to significantly reduce the effort in reviewing documents.
arXiv Detail & Related papers (2020-12-09T22:45:40Z) - An Extensive Study on Cross-Dataset Bias and Evaluation Metrics
Interpretation for Machine Learning applied to Gastrointestinal Tract
Abnormality Classification [2.985964157078619]
Automatic analysis of diseases in the GI tract is a hot topic in computer science and medical-related journals.
A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level.
We present comprehensive evaluations of five distinct machine learning models that can classify 16 different GI tract conditions.
arXiv Detail & Related papers (2020-05-08T08:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.