Related papers: Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

URL: http://arxiv.org/abs/2405.14445v1
Date: Thu, 23 May 2024 11:24:23 GMT
Title: Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study
Authors: Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas,
Abstract summary: This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Overall, results indicated an accuracy of around 80%, with some variability between domains.
Score: 0.28318468414401093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

Related papers

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction [0.3441021278275805]
This study evaluates the practical performance of three LLMs across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics.<n>We tested four distinct prompting strategies to determine how to improve extraction quality.<n> customised prompts were the most effective, boosting recall by up to 15%.
arXiv Detail & Related papers (2025-07-20T23:09:04Z)
Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z)
Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge [7.064104563689608]
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction.<n>This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction.
arXiv Detail & Related papers (2025-06-01T02:01:52Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs [14.618008816273784]
MINT is a general approach to determine if given data was used for training machine learning models. This work focuses on its application to the domain of Natural Language Processing.
arXiv Detail & Related papers (2025-03-10T14:32:56Z)
Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment [4.788487793976781]
Large Language Models (LLMs) can help automate text classification tasks at low cost and scale. By contrast, human coding is generally more reliable but expensive to procure at scale. We propose a hybrid solution to leverage the strengths of both.
arXiv Detail & Related papers (2025-01-15T20:13:46Z)
On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections. Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives. Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z)
Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis [7.059964549363294]
This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs) Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction.
arXiv Detail & Related papers (2024-11-16T20:18:57Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review [42.112100361891905]
This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field.
arXiv Detail & Related papers (2024-09-06T20:12:57Z)
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization [9.364214238045317]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks. In this study, we investigate whether LLMs can evaluate bug report summarization effectively.
arXiv Detail & Related papers (2024-09-01T06:30:39Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models [19.72316842477808]
We evaluate whether modern large language models (LLMs) can reliably perform this task. Massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis.
arXiv Detail & Related papers (2024-05-02T19:20:11Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.