Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study
- URL: http://arxiv.org/abs/2405.14445v1
- Date: Thu, 23 May 2024 11:24:23 GMT
- Title: Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study
- Authors: Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas,
- Abstract summary: This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.
Overall, results indicated an accuracy of around 80%, with some variability between domains.
- Score: 0.28318468414401093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.
Related papers
- On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections.
Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives.
Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis [7.059964549363294]
This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs)
Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction.
arXiv Detail & Related papers (2024-11-16T20:18:57Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review [42.112100361891905]
This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review.
We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field.
arXiv Detail & Related papers (2024-09-06T20:12:57Z) - LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization [9.364214238045317]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks.
In this study, we investigate whether LLMs can evaluate bug report summarization effectively.
arXiv Detail & Related papers (2024-09-01T06:30:39Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models [19.72316842477808]
We evaluate whether modern large language models (LLMs) can reliably perform this task.
Massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis.
arXiv Detail & Related papers (2024-05-02T19:20:11Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.