Zero-shot Generative Large Language Models for Systematic Review
Screening Automation
- URL: http://arxiv.org/abs/2401.06320v2
- Date: Thu, 1 Feb 2024 02:08:28 GMT
- Title: Zero-shot Generative Large Language Models for Systematic Review
Screening Automation
- Authors: Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan
Koopman, Guido Zuccon
- Abstract summary: This study investigates the effectiveness of using zero-shot large language models for automatic screening.
We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
- Score: 55.403958106416574
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Systematic reviews are crucial for evidence-based medicine as they
comprehensively analyse published research findings on specific questions.
Conducting such reviews is often resource- and time-intensive, especially in
the screening phase, where abstracts of publications are assessed for inclusion
in a review. This study investigates the effectiveness of using zero-shot large
language models~(LLMs) for automatic screening. We evaluate the effectiveness
of eight different LLMs and investigate a calibration technique that uses a
predefined recall threshold to determine whether a publication should be
included in a systematic review. Our comprehensive evaluation using five
standard test collections shows that instruction fine-tuning plays an important
role in screening, that calibration renders LLMs practical for achieving a
targeted recall, and that combining both with an ensemble of zero-shot models
saves significant screening time compared to state-of-the-art approaches.
Related papers
- A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions [20.51842378080194]
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers.
A significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments.
We systematically review 47 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated.
arXiv Detail & Related papers (2024-10-24T17:58:22Z) - Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance.
The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z) - Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.
Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations [22.668383945059762]
We present a systematic evaluation of four representative Large Language Models (LLMs) across 12 BioNLP datasets.
The evaluation is conducted under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning.
We compare these models against state-of-the-art (SOTA) approaches that fine-tune (domain-specific) BERT or BART models.
arXiv Detail & Related papers (2023-05-10T13:40:06Z) - Automating Document Classification with Distant Supervision to Increase
the Efficiency of Systematic Reviews [18.33687903724145]
Well-done systematic reviews are expensive, time-demanding, and labor-intensive.
We propose an automatic document classification approach to significantly reduce the effort in reviewing documents.
arXiv Detail & Related papers (2020-12-09T22:45:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.